Sentiment Analysis With BigQuery ML

Introduction

We recently announced BigQuery support for sparse features which help users to store and process the sparse features efficiently while working with them. That functionality enables users to represent sparse tensors and train machine learning models directly in the BigQuery environment. Being able to represent sparse tensors is a useful feature because sparse tensors are used extensively in encoding schemes like TF-IDF as part of data pre-processing in NLP applications and for pre-processing images with a lot of dark pixels in computer vision applications.

There are numerous applications of sparse features such as text generation and sentiment analysis. In this blog, we’ll demonstrate how to perform sentiment analysis with the space features in BigQuery ML by training and inferencing machine learning models using a public dataset. This blog also highlights how easy it is to work with unstructured text data on BigQuery, an environment traditionally used for structured data.

Partner with liwaiwai.com
for your next big idea.
Let us know here.

From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.

CYBERPOGO.COM :: For the Arts, Sciences, and Technology.

DADAHACKS.COM :: Parenting For The Rest Of Us.

ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.

TAKUMAKU.COM :: For The Hearth And Home.

ASTER.CLOUD :: From The Cloud And Beyond.

LIWAIWAI.COM :: Intelligence, Inside and Outside.

GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.

FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.

ASTERCASTER.COM :: Supra Astra. Beyond The Stars.

BARTDAY.COM :: Prosperity For Everyone.

Using sample IMDb dataset

Let’s say you want to conduct a sentiment analysis on movie reviews from the IMDb website. For the benefit of readers who want to follow along, we will be using the IMDb reviews dataset from BigQuery public datasets. Let’s look at the top 2 rows of the dataset.

Although the reviews table has 7 columns, we only use reviews and label columns to perform sentiment analysis for this case. Also, we are only considering negative and positive values in the label columns. The following query can be used to select only the required information from the dataset.

SELECT
 review,
 label,
FROM 
 `bigquery-public-data.imdb.reviews`
WHERE
 label IN ('Negative', 'Positive')

The top 2 rows of the result is as follows:

Methodology

Based on the dataset that we have, the following steps will be carried out:

Build a vocabulary list using the review column
Convert the review column into sparse tensors
Train a classification model using the sparse tensors to predict the label (“positive” or “negative”)
Make predictions on new test data to classify reviews as positive or negative.

Feature engineering

In this section, we will convert the text from the reviews column to numerical features so that we can feed them into a machine learning model. One of the ways is the bag-of-words approach where we build a vocabulary using the words from the reviews and select the most common words to build numerical features for model training. But first, we must extract the words from each review. The following code creates a dataset and a table with row numbers and extracted words from reviews.

-- Create a dataset named `sparse_features_demo` if doesn’t exist
CREATE SCHEMA IF NOT EXISTS sparse_features_demo;




-- Select unique reviews with only negative and positive labels
CREATE OR REPLACE TABLE sparse_features_demo.processed_reviews AS (
 SELECT
   ROW_NUMBER() OVER () AS review_number,
   review,
   REGEXP_EXTRACT_ALL(LOWER(review), '[a-z]{2,}') AS words,
   label,
   split
 FROM (
   SELECT
     DISTINCT review,
     label,
     split
   FROM
     `bigquery-public-data.imdb.reviews`
   WHERE
     label IN ('Negative', 'Positive')
 )
);

The output table from the query above should look like this:

The next step is to build a vocabulary using the extracted words. The following code creates a vocabulary including word frequency and word index from reviews. For this case, we are going to select only the top 20,000 words to reduce the computation time.

-- Create a vocabulary using train dataset and select only top 20,000 words based on frequency
CREATE OR REPLACE TABLE sparse_features_demo.vocabulary AS (
 SELECT
   word,
   word_frequency,
   word_index
 FROM (
   SELECT
     word,
     word_frequency,
     ROW_NUMBER() OVER (ORDER BY word_frequency DESC) - 1 AS word_index
   FROM (
     SELECT
       word,
       COUNT(word) AS word_frequency
     FROM
       sparse_features_demo.processed_reviews,
       UNNEST(words) AS word
     WHERE
       split = "train"
     GROUP BY
       word
   )
 )
 WHERE
   word_index < 20000 # Select top 20,000 words based on word count
);

The following shows the top 10 words based on frequency and their respective index from the resulting table of the query above.

Creating a sparse feature

Now we will use the newly added feature to create a sparse feature in BigQuery. For this case, we aggregate `word_index` and `word_frequency` in each review, which generates a column as ARRAY[STRUCT<int, numerical>] type. Now, each review is represented as ARRAY[(word_index, word_frequency)].

-- Generate a sparse feature by aggregating word_index and word_frequency in each review.
CREATE OR REPLACE TABLE sparse_features_demo.sparse_feature AS (
 SELECT
   review_number,
   review,
   ARRAY_AGG(STRUCT(word_index, word_frequency)) AS feature,
   label,
   split
 FROM (
   SELECT
     DISTINCT review_number,
     review,
     word,
     label,
     split
   FROM
     sparse_features_demo.processed_reviews,
     UNNEST(words) AS word
   WHERE
     word IN (SELECT word FROM sparse_features_demo.vocabulary)
 ) AS word_list
 LEFT JOIN
   sparse_features_demo.vocabulary AS topk_words
   ON
     word_list.word = topk_words.word
 GROUP BY
   review_number,
   review,
   label,
   split
);

Once the query is executed, a sparse feature named `feature` will be created. That `feature` column is an `ARRAY of STRUCT` column which is made of `word_index` and `word_frequency` columns. The picture below displays the resulting table at a glance.

Training a BigQuery ML model

We just created a dataset with a sparse feature in BigQuery. Let’s see how we can use that dataset to train with a machine learning model with BigQuery ML. In the following query, we will train a logistic regression model using the review_number, review, and feature to predict the label:

-- Train a logistic regression classifier using the data with sparse feature
CREATE OR REPLACE MODEL sparse_features_demo.logistic_reg_classifier
 TRANSFORM (
   * EXCEPT (
       review_number,
       review
     )
 )
 OPTIONS(
   MODEL_TYPE='LOGISTIC_REG',
   INPUT_LABEL_COLS = ['label']
 ) AS
 SELECT
   review_number,
   review,
   feature,
   label
 FROM
    sparse_features_demo.sparse_feature
 WHERE
   split = "train"
;

Now that we have trained a BigQuery ML Model using a sparse feature, we evaluate the model and tune it as needed.

-- Evaluate the trained logistic regression classifier
SELECT * FROM ML.EVALUATE(MODEL sparse_features_demo.logistic_reg_classifier);

The score looks like a decent starting point, so let’s go ahead and test the model with the test dataset.

-- Evaluate the trained logistic regression classifier using test data
SELECT * FROM ML.EVALUATE(MODEL sparse_features_demo.logistic_reg_classifier,
 (
   SELECT
     review_number,
     review,
     feature,
     label
   FROM
     sparse_features_demo.sparse_feature
   WHERE
     split = "test"
 )
);

The model performance for the test dataset looks satisfactory and it can now be used for inference. One thing to note here is that since the model is trained on the numerical features, the model will only accept numeral features as input. Hence, the new reviews have to go through the same transformation steps before they can be used for inference. The next step shows how the transformation can be applied to a user-defined dataset.

Sentiment predictions from the BigQuery ML model

All we have left to do now is to create a user-defined dataset, apply the same transformations to the reviews, and use the user-defined sparse features to perform model inference. It can be achieved using a WITH statement as shown below.

WITH
 -- Create a user defined reviews
 user_defined_reviews AS (
   SELECT
     ROW_NUMBER() OVER () AS review_number,
     review,
     REGEXP_EXTRACT_ALL(LOWER(review), '[a-z]{2,}') AS words
   FROM (
     SELECT "What a boring movie" AS review UNION ALL
     SELECT "I don't like this movie" AS review UNION ALL
     SELECT "The best movie ever" AS review
   )
 ),


 -- Create a sparse feature from user defined reviews
 user_defined_sparse_feature AS (
   SELECT
     review_number,
     review,
     ARRAY_AGG(STRUCT(word_index, word_frequency)) AS feature
   FROM (
     SELECT
       DISTINCT review_number,
       review,
       word
     FROM
       user_defined_reviews,
       UNNEST(words) as word
     WHERE
       word IN (SELECT word FROM sparse_features_demo.vocabulary)
   ) AS word_list
   LEFT JOIN
     sparse_features_demo.vocabulary AS topk_words
     ON
       word_list.word = topk_words.word
   GROUP BY
     review_number,
     review
 )


-- Evaluate the trained model using user defined data
SELECT review, predicted_label FROM ML.PREDICT(MODEL sparse_features_demo.logistic_reg_classifier,
 (
   SELECT
     *
   FROM
     user_defined_sparse_feature
 )
);

Here is what you would get for executing the query above:

And that’s it! We just performed a sentiment analysis on the IMDb dataset from a BigQuery Public Dataset using only SQL statements and BigQuery ML. Now that we have demonstrated how sparse features can be used with BigQuery ML models, we can’t wait to see all the amazing projects that you would create by harnessing this functionality.

If you’re just getting started with BigQuery, check out our interactive tutorial to begin exploring.

By: Thu Ya Kyaw (Developer Advocate, Google Cloud) and Xiaoqiu Huang (Software Engineer, Google Cloud)
Originally published at Google Cloud Blog

Source: Cyberpogo