Recommendation Systems with Spark on Google DataProc
Recommendation Engines. Spark. Cloud Infrastructure. Big Data.
Feeling overwhelmed with trendy buzzwords yet?
In this tutorial, You’ll be learning how to create a movie recommendation system with Spark, utilizing PySpark.
The tutorial will focus more on deployment rather than code. We’ll be deploying our project on Google’s Cloud Infrastructure using:
- Google Cloud DataProc
- Google Cloud Storage
- Google Cloud SQL
Table of Contents:
- Getting the Data
- Storing the Data
- Training the Model
- Deploying to Cloud DataProc
Getting the data:
You’ll first need to download the dataset we’ll be working with. You can access the small version of the movielens’ dataset here (1MB). After verifying your work, you can test it with the full dataset here (224MB).
Within this dataset, you’ll be utilizing the ratings.csv and movie.csv files. Each file provides headers for the columns as the first line entry. You’ll need to remove this before loading the data into CloudSQL.
Storing the data:
Google Cloud SQL:
Google Cloud SQL is a service that makes it easy to set-up, maintain, manage and administer your relational MySQL databases in the cloud. Hosted on Google Cloud Platform, Cloud SQL provides a database infrastructure for applications running anywhere.
You’ll need to create a few SQL scripts to create the db and tables.
Load SQL scripts and data into Cloud Storage:
Create a bucket and load the scripts into Google Cloud Storage. Buckets in cloud storage have unique name identifications. You’ll need to replace below with a name of your choosing. Using Google-Cloud-Sdk from terminal:
$ gsutil mb gs://movie-rec-tutorial $ gsutil cp *.sql gs://movie-rec-tutorial
$ gsutil cp movies.csv ratings.csv gs://movie-rec-tutorial
(If you haven’t setup Google-Cloud-Sdk)
brew install cask google-cloud-sdk
After this step, you can look into GCloud Storage and confirm the files were successfully uploaded.
Next, you’ll create your sql database. Select second-generation. (I’ve disabled backups as this is a project and not necessary, you may choose otherwise):
After initializing the db, set a new root password (instance -> access control -> users) :
Take the scripts previously loaded into Cloud Storage and import them on the SQL instance. This will create the db & tables. (Import -> nameOfYourStorage -> Scripts). Due to foreign key constraints, execute the scripts in this order: (create_db, movies, ratings, recommendations)
When loading the table scripts, you’ll need to specify the database in the advanced options. The create_db file creates a DB with name “TEST”, referred to here.
Load the data:
Take the csv files (movies.csv, ratings.csv) and load them into the Movies and Ratings tables, respectively.
Open the Network:
To connect to the instance from the cluster, you’ll need to allow the ip addresses of the nodes access. For simplicity, open the allowed traffic to the world.
- *NOTE: This should not be done in production. Please look into CloudSQL Proxy for production grade security.
If you would like to confirm everything up to this point has been executed correctly, take your favorite Mysql client and connect to your instance. Sequel Pro is my preferred.
You’ll need to take the IPv4 address of your SQL instance found here:
Training the model
You’ve loaded the data into Cloud SQL; time to begin training the model. With the code below, you’ll connect to your CloudSQL instance, read in the tables, train the model, generate and write predictions to the recommendations table.
Walking through the code:
The first few lines configure spark setting the app name as well as memory constraints for our driver/executor. Information regarding additional properties can be found at Spark Conf Docs.
After configuring spark, the code creates a url for our database connectivity with the information supplied from sys.args. This url is used to read the tables from the CloudSQL instance.
Next, the hyper parameters supplied to the ALS algorithm. I have supplied default parameters that perform well for this dataset. More information can be found at Spark ALS Docs.
Finally, the model is trained with the ratings dataframe. After training, top 10 predictions are generated and written back to the CloudSQL instance.
Copy this file to your Google Cloud Storage bucket:
$ gsutil cp engine.py gs://movie-rec-tutorial
Deploying to Google Cloud DataProc:
DataProc is a managed Hadoop and Spark service that is used to execute the engine.py file over a cluster of compute engine nodes.
You’ll need to enable the API in the API Manager:
Next, head over to DataProc and configure a cluster:
After configuring the cluster and waiting for each node to be initialized, create a job to execute engine.py. You’ll need to provide your CloudSQL instance IP, username, and password as sys args (seen below).
That’s it! If you’d like to check out the written recommendations, connect your preferred mysql-client.
In this tutorial, you created a db & tables within CloudSQL, trained a model with Spark on Google Cloud’s DataProc service, and wrote predictions back into a CloudSQL db.
Next Steps from here:
- Configure CloudSQL proxy
- Create an API to interact with your recommendation service
Complete code here:
If you liked the tutorial, follow & recommend!
Other places to find me: