We all love movies, right? Well, most of us do. Irrespective of language, geography, or culture, we enjoy not only watching movies but also talking about the nuances and qualities that go into making a movie successful. I have often wondered, “If only I could alter a few aspects and create an impactful difference in the outcome in terms of the movie’s rating or success factor.” That would involve predicting the success score of the movie so I can play around with the variables, dialing values up and down to impact the result. That is exactly what we have done in this experiment.
Summary of architecture
Today we’ll predict a Movie Score using Vertex AI AutoML and have transactionally stored it in MongoDB Atlas. The model is trained with data stored in BigQuery and registered in Vertex AI. The list of services can be composed into three sections:
1. ML Model Creation
2. User Interface / Client Application
3. Trigger to predict using the ML API
ML Model Creation
1. Data sourced from CSV to BigQuery
2. BigQuery data integrated into Vertex AI for AutoML model creation
3. Model deployed in Vertex AI Model Registry for generating endpoint API
User Interface Application
4. MongoDB Atlas for storing transactional data and powering the client application
5. Angular client application interacting with MongoDB Atlas
6. Client container deployed in Cloud Run
Trigger to predict using the ML API
7. Java Cloud Functions to trigger invocation of the deployed AutoML model’s endpoint that takes in movie details as request from the UI, returns the predicted movie SCORE, and writes the response back to MongoDB
High-level overview of the architecture
Preparing training data
You can use any publicly available dataset, create your own, or use the dataset from CSV in git. I have done basic processing steps for this experiment in the dataset in the link. Feel free to do an elaborate cleansing and preprocessing for your implementation. Below are the independent variables in the dataset:
Genre (String, Categorical)
Country (String, Categorical)
BigQuery dataset using Cloud Shell
BigQuery is a serverless, multi-cloud data warehouse that can scale from bytes to petabytes with zero operational overhead. This makes it a great choice for storing ML training data. But there’s more — the built-in machine learning (ML) and analytics capabilities allow you to create no-code predictions using just SQL queries. And you can access data from external sources with federated queries, eliminating the need for complicated ETL pipelines. You can read more about everything BigQuery has to offer in the BigQuery product page.
BigQuery allows you to focus on analyzing data to find meaningful insights. In this blog, you’ll use the bq command-line tool to load a local CSV file into a new BigQuery table. Follow the below steps to enable BigQuery:
Activate Cloud Shell and create your project
You will use Cloud Shell, a command-line environment running in Google Cloud. Cloud Shell comes pre-loaded with bq.
- In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
- Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
- Navigate to BigQuery to enable the API. You can also open the BigQuery web UI directly by entering the following URL in your browser:
- From the Cloud Console, click Activate Cloud Shell. Make sure you navigate to the project and that it’s authenticated. Refer to gcloud config commands.
Creating and loading the dataset
A BigQuery dataset is a collection of tables. All tables in a dataset are stored in the same data location. You can also attach custom access controls to limit access to a dataset and its tables.
1. In Cloud Shell, use the bq mk command to create a dataset called “movies.”
bash bq mk –location=<<LOCATION>> movies
bash git clone <<repository link>> cd movie-score
bash bq load --source_format=CSV --skip_leading_rows=1 movies.movies_score \ ./movies_bq_src.csv \ Id:numeric,name:string,rating:string,genre:string,year:numeric,released:string,score:string,director:string,writer:string,star:string,country:string,budget:numeric,company:string,runtime:numeric,data_cat:string
A schema, which can be defined in a JSON schema file or as a comma-separated list. (I’ve used a comma-separated list.)
Hurray! Our CSV data is now loaded in the table movies.movies. Remember, you can create a view to keep only essential columns that contribute to the model training and ignore the rest.
5. Let’s query it, quick!
We can interact with BigQuery in three ways:
- BigQuery web UI
- The bq command
Your queries can also join your data against any dataset (or datasets, so long as they’re in the same location) that you have permission to read. Find a snippet of the sample data below:
SELECT name, rating, genre, runtime FROM movies.movies_score limit 3;
Predicting movie success score (user score on a scale of 1-10)
In this experiment, I am predicting the success score (user score/rating) for the movie as a multi-class classification model on the movie dataset.
A quick note about the choice of model
This is an experimental choice of model chosen here, based on the evaluation of results I ran across a few models initially and finally went ahead with LOGISTIC REG to keep it simple and to get results closer to the actual movie rating from several databases. Please note that this should be considered just as a sample for implementing the model and is definitely not the recommended model for this use case. One other way of implementing this is to predict the outcome of the movie as GOOD/BAD using the Logistic Regression model instead of predicting the score.
Using BigQuery data in Vertex AI AutoML integration
Use your data from BigQuery to directly create an AutoML model with Vertex AI. Remember, we can also perform AutoML from BigQuery itself and register the model with VertexAI and expose the endpoint. Refer to the documentation for BigQuery AutoML. In this example, however, we will use Vertex AI AutoML to create our model.
Creating a Vertex AI data set
Go to Vertex AI from Google Cloud Console, enable Vertex AI API if not already done, expand data and select Datasets, click on Create data set, select TABULAR data type and the “Regression / classification” option, and click Create:
The BigQuery instance and Vertex AI data sets should have the same region in order for the BigQuery table to show up in Vertex AI.
Serverless web application with MongoDB Atlas and Angular
The user interface for this experiment is using Angular and MongoDB Atlas and is deployed on Cloud Run. Check out the blog post describing how to set up a MongoDB serverless instance to use in a web app and deploy that on Cloud Run.
In the application, we’re also utilizing Atlas Search, a full-text search capability, integrated into MongoDB Atlas. Atlas Search enables autocomplete when entering information about our movies. For the data, we imported the same dataset we used earlier into Atlas.
MongoDB Atlas for transactional data
In this experiment, MongoDB Atlas is used to record transactions in the form of:
- Real time user requests.
- Prediction result response.
- Historical data to facilitate UI fields autocompletion.
If instead, you want to configure a pipeline for streaming data from MongoDB to BigQuery and vice-versa, check out the dedicated Dataflow templates.
Once you provision your cluster and set up your database, make sure to note the below in preparation of our next step, creating the trigger:
- Connection String
- Database Name
- Collection Name
Please note that this client application uses the Cloud Function Endpoint (which is explained in the below section) that uses user input and predicts movie score and inserts in MongoDB.
Java Cloud Function to trigger ML invocation from the UI
Cloud Functions is a lightweight, serverless compute solution for developers to create single-purpose, stand-alone functions that respond to Cloud events without needing to manage a server or runtime environment. In this section, we will prepare the Java Cloud Functions code and dependencies and authorize for it to be executed on triggers.
Remember how we have the endpoint and other details from the ML deployment step? We are going to use that here, and since we are using Java Cloud Functions, we will use pom.xml for handling dependencies. We use google-cloud-aiplatform library to consume the Vertex AI AutoML endpoint API:
<dependency> <groupId>com.google.cloud</groupId> <artifactId>google-cloud-aiplatform</artifactId> <version>3.1.0</version> </dependency>
If you are using Gen2 (recommended), you can use the class name and package as-is. If you use Gen1 Cloud Functions, please change the package name and class name to “Example.”5. In the .java file, you will notice the part where we connect to MongoDB instance to write data: (use your credentials)
MongoClient client = MongoClients.create(YOUR_CONNECTION_STRING); MongoDatabase database = client.getDatabase("movies"); MongoCollection<Document> collection = database.getCollection("movies");
PredictionServiceSettings predictionServiceSettings = PredictionServiceSettings.newBuilder().setEndpoint("<<location>>-aiplatform.googleapis.com:443") .build(); int cls = 0; … EndpointName endpointName = EndpointName.of(project, location, endpointId);
That’s it! Nothing else to do in this section. The endpoint is used in the client application for the user interface to send user parameters to Cloud Functions as a request and receive movie score as a response. The endpoint also writes the response and request to the MongoDB collection.
Thank you for following us on this journey! As a reward for your patience, you can check out the predicted score for your favorite movie.
- Analyze and compare the accuracy and other evaluation parameters between the BigQuery ML manually using SQLs and Vertex AI Auto ML model.
- Play around with the independent variables and try to increase the accuracy of the prediction result.
- Take it one step further and try the same problem as a Linear Regression model by predicting the score as a float/decimal point value instead of rounded integers.
To learn more about some of the key concepts in this post you can dive in here:
By: Abirami Sukumaran (Developer Advocate, Google) and Stanimira Vlaeva (MongoDB)
Source: Google Cloud Blog