Movie Recommendation System

Janine B
20 min readDec 10, 2020

By Janine Bariuan, Kevin Chen, Bibartan Jha, Sudeep Narala, Rooshi Patidar, and Diana Shao

Introduction

For this project, our team decided to try and create a movie recommendation system that utilizes multiple datasets to output a list of movies recommended for the user. We found some datasets online, including the MovieLens 100K Dataset and Movie Poster dataset that we eventually used for our final model. By using these datasets, combined with scraped data, and implementing two different models, we were able to create a thorough movie recommendation system.

Follow along with our Github repo here.

Datasets

There were two existing datasets that we used: the MovieLens 100K Dataset and the Movie Poster Dataset.

MovieLens 100K Dataset

The Movielens 100K Dataset comprises of 100,000 ratings for 1,600+ movies made by 900+ users. The official website can be found here.

We used the items (movies), ratings, and users datasets.

Movie Poster Dataset

This dataset takes all of the movies found in the MovieLens 100K Dataset and provides a URL to the poster image for that movie. The link to the official dataset can be found here.

We used the movie_poster.csv file.

Pre-Processing

MovieLens 100K Dataset

Since the MovieLens dataset consists of thousands of ratings and movies, we wanted to find any trends that occurred in the data. Therefore, we could then transform the dataset according to this information.

One of the first things we wanted to see was the frequency of the number of ratings for movies.

We found that there was a high number of movies that were only given one rating, which could only be seen as noise when trying to create rating predictions. To adjust this, we removed all of the movies from the ratings data that had less than 15 total ratings from users.

Other steps we took included:

  • Removing the unix timestamp from the dataframes, as it was not a factor in making recommendations
Fixed training dataframe for Matrix Factorization model
  • 0-indexing the data to be more consistent
  • And changing the user_id and movie_id columns to integer types to make it easier to work with the data

Movie Poster Dataset

With this dataset, we pre-processed the training set X_train by encoding the poster images retrieved from the movie posters dataset. For this purpose, we used a pre-processing function within the Keras library called “applications.vgg19.preprocess_input()”. Since the VGG19 CNN model uses the ImageNet dataset as weights, this pre-processing function aligns the images encoded within X_train to better suit ImageNet by converting the images from RGB to BGR and zero-centering the color channels with respect to ImageNet.

This pre-processing technique enabled us to better optimize our CNN and KNN models. The main evaluation metric we used to test the accuracy of our models is the mean distance from all the neighbors for each query. The smaller the mean distance, the greater the accuracy of the models.

The evaluation metric for models will be discussed in more detail in the “Movie Posters Model” section below.

The two graphs below show the “mean distance from neighbors” results of the first 100 images. These results are derived from the same VGG19 model (using the same parameters). The top graph is derived from the model prior to applying pre-processing to the X_train data, and the bottom graph is from after applying pre-processing.

Prior to pre-processing
After pre-processing

As evidenced by the graphs above, the mean distances from neighbors for the first 100 posters is much higher prior to pre-processing than after pre-processing. This validates that the pre-processing function optimized the CNN and KNN models.

Scraping

We decided to increase the amount and scope of our data by scraping some overall movie ratings on IMDB and Rotten Tomatoes and factor those into our model. We scraped all available ratings for the movies listed in the MovieLens 100K dataset.

IMDB

For the IMDB scraping, we used the requests library (a simple HTTP library for Python) to access the IMDB search page and query by the movie title.

We found that the format of the titles in the MovieLens dataset were not suitable for searching, so we removed the year in parentheses to get better results.

Then, once we had the response page in HTML format, we used the BeautifulSoup library to easily parse through the page and find the search results. We then iterated through the results until the result year matched the query’s year, and opened the respective url using the urlopen method from the urllib library.

Finally, it was a matter of using BeautifulSoup again to find the specific ratings on the IMDB movie page. Since IMDB has two types of ratings, a user rating and a metascore (critic rating), we attempted to scrape both, if available. Some movies weren’t popular enough to garner a metascore, so those were just represented as NaN. Additionally, some movies couldn’t be found at all, which turned out to mostly bedue to mismatched years or movies with alternate names. These were added into the dataset manually by searching IMDB or Google and inputting them by hand into the .csv file.

Here is a visual of the resulting columns that were created from the IMDB scraping:

Scraped IMDB Data

Rotten Tomatoes

For the Rotten Tomatoes scraping, we followed a similar process as with IMDB. We utilized the requests library and a public API endpoint of Rotten Tomatoes to query the search results of their database and find a movie title at a specific year. We encountered similar problems with the MovieLens format, but likewise removed special characters and extraneous elements of the movie titles as we did when working with IMDB.

Rotten Tomatoes’ API allowed us to work with a JSON format of the movie results, so finding the URL for a specific movie was as simple as navigating the JSON. We then were able to use BeautifulSoup at that URL to parse for the Rotten Tomatoes critic and audience scores easily.

We, however, ran into the issue of having sent too many requests to the Rotten Tomatoes API and subsequently had some of our IPs blacklisted from their server, causing us to be unable to even visit their web page regularly. We ended up deciding that having 2 more ratings in addition to the ratings that we already had from IMDB and users would not produce too much of a difference that merited using a VPN to gain access and continue scraping. Instead, we decided to focus more on other parts of our system.

System Overview

Above is a high level overview of our recommendation system. It begins with the input query made by the user who wants to receive recommendations. This input gets used in our first model that makes predicted ratings and recommendations based off of movie poster information. This output from the first model was then formatted to a new dataframe that holds the recommended movies and their predicted ratings for this new user. This data, in addition to the original user input, gets used in the matrix factorization model to get user and movie embeddings. From this model, a final output of the top 5 recommended movies is made. Each model and the input query is explained in more detail in the following sections.

Input Query

Our recommendation system works for a single user, based on the ratings of previous movies they have watched. This input was obtained using this Google Form, listing 15 movies and asking the user to rate each one from 1–5, 1 for hating the movie and 5 for loving it.

The movies listed on the Google Form were chosen from the top 15 most-rated movies in the dataset that we found. We chose this so there would be a more accurate prediction since these movies would appear in the (user, movie) embedding space more frequently.

It would also allow for a consistent list that users would rate, and would not add to the noise we removed if a user inputted their own favorite movies that has a low amount of ratings in the existing dataset.

We then transformed the input query into a dictionary of the movie ID’s for those top 15 movies, with the rating of the user.

This dictionary is used as the input for the movie poster model.

Movie Posters Model

The purpose of the movie posters model is to expand upon the dictionary from the input query by predicting the ratings that users will give to movies similar to the ones already inside the input query. This model gauges the similarity between different movies based on the similarities within the movie poster. We are operating under the assumption that if Movie A and Movie B has similar posters, and the user has already rated Movie A with a score of 3.0, then the user will most likely have a similar level of excitement or interest in Movie B (thus resulting in Movie B also receiving a rating of 3.0).

Model

The initial step we took for the creation of this model is retrieving the 1600+ movie poster images from the 100K dataset. By doing so, we created numpy-based arrays encoding each image. One issue that we run into for this particular task is that not all the images could be loaded from the links provided in the dataset.

To sidestep this issue, this model will ignore any movie from the user query for which the poster was not able to be loaded. Through further work on this project, we can add our own images to the dataset for those movies whose posters were not loaded.

In order to find similarities between movie posters, we followed the unsupervised method of transfer learning.

Our first step was to train a VGG19 CNN. VGG19 is a CNN with 19 3x3 convolution layers, which allows for the use of weights from the ImageNet dataset in order to classify input images into different categories. The reason we chose to use the VGG19 CNN is that it is pre-trained from the ImageNet dataset; as a result, we will not need to perform any further training in order to find the weights needed for the network.

Prior to feeding our images into the VGG19 model, we pre-processed the numpy arrays for each poster encoding through the “applications.vgg19.preprocess_input()” function. The purpose of this preprocessing was explained in the section “Pre-processing: Movie Posters Dataset” above.

In order to better adapt the VGG19 model for classifying our images, it was important to remove the last few convolution layers. This will make sure that the model does not become too specific in classifying the images — otherwise, the model will attempt to predict the actual objects within the poster images. We remove the last layer by setting the hyperparameter index_true = false.

Outputting Ratings Predictions and Similar Posters

Upon generating a list of features by using the pre-trained VGG19 model to make predictions on the image encodings, a KNN model will be used in order to generate the 6 closest neighbors to a certain movie poster. In other words, for a certain poster A, the KNN model will use the classifications from the CNN in order to find the 6 most similar posters to poster A.

For our KNN model, we used the NearestNeighbors unsupervised learner for sklearn. NearestNeighbors was particularly helpful because it not only provided the closest neighbors to each query, but also the distances between the query and each of those numbers.

For example, one of our models produced 6 neighbors for the Toy Story poster. The posters that the model considers closest to the Toy Story poster are as follows:

Furthermore, are movie IDs for the neighbor posters:

Finally, here are the KNN-calculated distances between the Toy Story Poster and the neighbor posters, as well as mean of all the distances:

If the movie Toy Story were within the user query dataframe, the neighbor movies shown in the image above would also be added to the dataframe (and they would each be given the same rating as Toy Story).

Optimizing Model and Metrics

However, we need to optimize the VGG19 model so that the KNN will generate the closest possible neighbors to all posters. The metric we will use for optimization is the mean distance between a given poster and all its neighbors (i.e. the mean distance for the Toy Story example above is 2.8401613235473633e-05). By finding the model that gives the lowest mean distances for all posters, we will be able to generate the closest possible neighbors for each poster.

One parameter we tuned for the VGG19 model was the “pooling” parameter. The pooling parameters take inputs such as “max”, “avg”, and “none” to determine the behavior during the maxpool layers in the CNN.

When pooling was set to none, the mean distances from the neighbors for the first 100 posters were as follows:

When pooling was set to avg, the mean distances from the neighbors for the first 100 posters were as follows:

When pooling was set to max, the mean distances from the neighbors for the first 100 posters were as follows:

As evidenced by the above 3 models, the VGG19 generally gives the lowest distances between a poster and its neighbors when the pooling is set to avg.

Ultimately, the VGG19 model as well as the KNN model were coded as follows:

Upon calculating the neighbors for each of the movies in the user query, we add those movies to the ratings dataframe, which will be used in the next step, the matrix factorization model. More details on what exactly is added can be found in the example we run through later on in the article.

Matrix Factorization Model

We now use a matrix factorization model to output the final movie recommendations to our user. In order to do this, we add our test user (with the extra data from the movie posters model) to the ratings we got from MovieLens.

Model

Let’s take a look at the actual model. First, we have to frame the problem in a way that the matrix factorization algorithm understands.

We construct a ratings matrix R such that Rij is the rating that user i rated movie j. We will have a lot of 0s in the matrix since a lot of (user, movie) pairs don’t have ratings.

Next, we will need to construct the matrices that R will be factorized into. One of the matrices will be the user embedding matrix U and the other is the movie embedding matrix M.

Row i of the user embedding matrix represents our representation of user i as a 50-dimensional vector (since we chose an embedding size of 50). We want to construct user and movie embedding matrices that make U*Mᵀ approximate R in all the places where R has a non-zero value. The reason this breakdown of the problem is vital in building our recommendation system is because if 2 users rate movies very similarly, their user embedding vectors are trained to be very similar to each other. The model is incentivized to do this (and accurately capture the rating values, of course) in order to reduce the error. This is also the reason we asked users to rate the top 15 most popular movies from the Movielens dataset. This allows for our test user to interact with as many other users as possible, and thus enables the test user to effectively better “control” the user embeddings.

Below is a snippet of code for training the matrix factorization model:

Metrics

We used mean squared error aggregated along all the non-zero values of the ratings matrix (R-U*Mᵀ). Here is a simplification of the MSE formula used as the loss for our training:

Following is the training error graph with respect to the iteration number:

At this point, we use a metric known as cosine distance in order to sort users in how similar they are to our test user. We could make recommendations to the user based on movies that similar users liked.

Incorporating Scraped Data

However, we wanted to incorporate genre information too. This drove us to use the IMDB ratings that we scraped earlier. We chose to use IMDB audience score because they are representative of how much the average person liked a particular movie, more so than averaging the MovieLens data because IMDB is based on more vetted ratings.

First, we need to make sure that the IMDB ratings can split our user ratings relatively well. Following is a plot of the percentage of times the user rates a movie higher than the IMDB rating for that movie by genre:

This shows that most genres are very close to 50%, so if a user consistently rates movies of a certain genre lower than the IMDB audience rating, it is statistically significant (i.e. the IMDB ratings aren’t just inflated or deflated when comparing them to the MovieLens dataset).

Making Recommendations

Finally, it is time to make the recommendations! In order to make good predictions, we take a 2-fold approach which uses everything we have described earlier:

  1. Sort users by similarity to the test user and
  2. Pick movies within a genre that our user liked and another user dislikes but still likes a movie in it.

First, we sort all users in descending order according to cosine “distance” (i.e. in the order of similarity to our test user).

Then, in order, we check which of these users dislikes one of the 3 genres that our user likes. We do this by checking if they rate movies in a certain genre lower than the IMDB ratings 70% of the time. We only consider the user for this genre if they have rated more than 25 movies in it in order to discard noise.

Finally, we sample randomly from all the movies in this genre that this user has rated a 5.0 and add the movie to our recommended list. We repeat this until we have 10 new movies to recommend to our user.

Voila! Now the user can enjoy movies to their liking. Filling out a google form in order to get these recommendations is much quicker than trying out all 1,682 movies for yourself!

Running Through An Example

We will now demonstrate a specific example of our entire recommendation system.

Input Query

We collect user ratings on the top 15 movies from the dataset and export the results into a .csv:

Converting this information into a dataframe, we get:

As you can see, this input user prefers movies for a younger generation, which reflects in their top 3 favorite genres of animation, children’s, and musicals. From here, we process the information and send it to the movie posters model.

Movie Posters Model

This is how the ratings dataframe looks when inputted into the model:

The model then goes on to find the closest neighbors for each of these 14 movies, and adds those movies to the rating dataframe.

Here are the closest neighbors found for some of the movies above:

Poster recommendations for ‘Toy Story’
Poster recommendations for ‘12 Monkeys’
Poster recommendations for ‘Fargo’
Poster recommendations for ‘Independence Day’
Poster recommendations for ‘The English Patient’

The Godfather was not included because its poster could not be retrieved from the posters database.

The neighbors for all of the posters will be added to the dataframe with the same ratings. So, the neighbors for Toy Story will all be added with a rating of ‘5’, the neighbors for Twelve Monkeys will all be added with a rating of ‘2’, etc.

This is how a portion of the ratings dataframe looked after running this model:

As we can see, the movie posters greatly increase the information that is fed into the matrix factorization model, allowing the final output to be more accurate for the user. This new ratings dataframe will now be sent as an input to the matrix factorization model.

Matrix Factorization Model

Now, we are ready to look at the final recommendation results.

We discussed earlier that one technique is to simply sort users by similarity to our test user and pick their favorite movies in our users genre. Using this technique yields the following movie recommendations for our test user:

Another technique, outlined earlier, is to sort users by similarity to our test user BUT also only pick a movie from one of the similar users only if they tend to dislike one of the genres our user likes. This is what we used as our final model. Using this technique yielded the following alternative recommendations:

We think the latter method is a better, more robust way to make predictions. However, we can’t know for sure if this technique more accurately captures the user’s tastes, unfortunately, because we don’t have access to user ratings of the movies we recommend them.

In the end, the combination of our two models outputted satisfactory recommendation as they align with the original test user’s input of favoring children’s, animation and musical movies.

Ideas That Didn’t Work Out

There were a few ideas we tested out in the beginning, that unfortunately, did not achieve satisfactory results.

XGBoost Classifier

Since the matrix factorization model ignores demographic information associated with the user, we tried to create an XGBoost Classifier to use this data after making the data categorical. The way we tried to incorporate the data was by creating a dynamic XGBoost model based on the recommendations made by the matrix factorization model.

For each given movie from our final output, an XGBoost model was implemented by creating training data from all the users who rated that movie. The target variable was 0 if the user disliked the movie, and 1 if they liked it. We chose a rating of 4 and higher to represent them liking it.

The idea was that if we get 5 movies to be recommended according to the matrix factorization model, we would create and train an XGBoost classifier for each of the movies. Next, we would check the soft probability of each of the 5 model’s output when given an input of our test user’s demographic information, representing the chance that someone with the demographic background of the test user would like this movie.

This idea didn’t pan out since either:

  1. There was either not enough data for each individual movie in our dataset to allow for a good model or
  2. Demographic data doesn’t accurately predict a user’s taste for movies. This led to us getting model outputs that were not much better than random guessing (in some cases it was worse, with test data AUC below 0.5).

Following is a plot of an attempt to run the XGBoost Classifier on the most popular movie in the MovieLens dataset (Toy Story). It shows that the AUC represents a model that is very close to random guessing so it doesn’t make much sense to use it:

Softmax Neural Network

For the second model, which is now achieved using matrix factorization, we originally attempted to implement a neural network with softmax as the final activation layer, which we spent much of our time on in the beginning. This was due to the fact that it allowed us to generate recommendations on more personal factors, such as occupation, age and other possible information we wanted to explore.

This model tried to predict movie ratings for the input user for the movies they had not yet rated. Therefore, the input to the model is the user and a movie, and it would try to predict the rating the user would give to that movie. To use softmax, we transformed the model into multiclass classification by one-hot encoding the ratings into five classes, each one representing a different rating: 1, 2, 3, 4, 5.

The model would then output a vector of 5 numbers, each element corresponding to the ratings classes in order. Below is an example for a predicted rating for user 942 for a random movie input. Here, the model predicted that user 942 would have most likely rated this movie a 5.

To represent each user and movie in the embedding layers, we set the number of factors to 50. We used to Keras to build our neural network, using the following layers:

  • Input(shape = 1), taking in the number of user inputs, which is 1
  • Input(shape = 1), taking in the number of movie inputs, which is 1
  • Embedding(input_dim = number_of_users), creating an embedding layer for all of the users found in the MovieLens dataset and the input user
  • Embedding(input_dim = number_of_movies), creating an embedding layer for all of the movies found in the MovieLens dataset
  • Two Reshape() layers to transform to vectors of 50 arbitrary numbers
  • Two Dropout(0.5) layers to each embedding layer to prevent overfitting
  • Concatenate(user_embedding, movie_embedding) to merge the two embedding layers
  • Dropout(0.5) again to prevent overfitting
  • Dense(activation = relu)
  • Dropout(0.5) again to prevent overfitting
  • Dense(activation = softmax), to make the final output layer

This model used categorical cross entropy as the train and validation loss, and used accuracy as the metric. After fitting, we obtained the following results:

There were some promising results from this model. As you can see, the training and validation loss did have a downward trend:

And the accuracy improved over each epoch:

However, the loss was significantly high and we were unable to ever increase the accuracy past 0.45. The actual predictions made using the model were significantly far off, as we would create a user query that favored children’s movies, but it would output dramas as the predicted highest rated movies. There were other layers we tried, such as a Dot() layer instead of Concatenation(), and sigmoid as the final activation layer instead of softmax, but these attempts trialed with hyperparameter tuning were to no avail. Therefore, we decided to focus on matrix factorization.

VGG19 (Movie Posters) Model Layers

For the VGG19 model, we tried removing certain different layers from the network. To do so, we implemented a Sequential() model and, after training the VGG19 model with all 19 layers, added certain selected layers to the Sequential() model. This was part of an attempt to see if removing layers would make the model more accurate for finding closest neighbors. Unfortunately, however, this did not show a consistent improvement in distance between queries and neighbors, so the idea of removing layers from the CNN was scrapped.

Conclusion

Overall, the final model combining VGG19 and matrix factorization proved to be successful. We were able to take completely different information from movies, such as its poster and genre, and use the ratings from previously watched movies and other users to create a proper recommendation system. We hope you enjoyed reading about our project!

This project was done as a part of our data science class final project at The University of Texas at Austin.

References:

--

--

Janine B

Engineer @ Google | UT Austin Electrical & Computer Engineering alum