Recommendation systems in Python
As a machine learning learner one important topic to learn is producing recommendations. The Netflix prize is cited everywhere but unfortunately learning resources are simple examples or very complex research papers. Here is my attempt to curate some resources to understand various types of recommendations and keeping a new learner in mind. I will use some libraries to predict and then evaluate outcomes.
Generally, the goal of recommendations is to show users some products, news stories, services, or anything new to discover that they may be interested in or find engaging (these are known as items). Here are some typical cases.
- Ranking a list of top movies at IMDB, top apps in the app store, best restaurants, top travel destinations, trending videos or tweets etc. In this approach some metric(s) associated to the items are used to determine order between items and a ranking. Some metric examples: rating of movies, $ gross of movies or apps etc. Such recommendations can be constructed easily, and every user sees the same list. A benefit is that such recommendations are always available so the “cold start” problem is not present. On the downside, maybe the viewer does not care for the results or for example, they have already watched the top 5 movies so nothing new to discover.
- Because a user watched a particular video on YouTube suggest other similar videos to them, or movies based on certain genres they prefer, or restaurants because they ordered at a particular one, etc. These are content based recommendations and a user’s preference is used to influence the recommendation. The user’s interests maybe explicitly asked for or implicitly captured based on some previous activity. It is reasonable to assume that recommendations would be more personalized but on the downside, users may not get to discover all points of view on any particular topic and cold start also needs to be considered. So first we need to understand what a user likes and then finds items similar to that.
- On Facebook or LinkedIn a user may get recommendations for certain events because their friend liked it or are going to that event. This is a user-based collaborative recommendation where similarity between users influences items to recommend. There is actually no relationship necessary between the users in question — they don’t have to be “friends” — its just that certain group of similar users prefer similar items.
- On Amazon or other ecommerce sites you may see recommendations for additional items to purchase when you add an item to your cart. The recommended item themselves may seem unrelated however other users who purchased the item you want have purchased the recommended items leading to the item similarity. Matches on Tinder or other dating sites are another example. This is item-based collaborative recommendation and the item content itself is not used. Tinder provides good intuition. For whatever unknown human reasons users that like a certain match can be shown other profiles where other users actions were similar. The features of the profile itself are not quantifiable.
There are hybrid approaches as well and as usual Wikipedia has a page for reference but it gets into too many details too quickly. Here is a simpler page to follow.
Let us try these methods in Python. Movie Lens provides the standard example used by everyone and makes it easy to understand and explain. I will use a subset of their data. The first is the
movies_metadata.csv which is basically the movie title, description, genres, overall rating etc, and the second is
ratings_small.csv which contains the rating given by users to the movies. Look at Kaggle for each file summary.
I have mentioned “similarity” multiple times so far and intuitively we understand what that means in a particular context like similar movies or similar users but what does it mean to a computer program?
The relational operators (equals, less than, greater than etc) are well understood and indeed indicate “similarity” of objects being compared. In context of machine learning problems though we are generally dealing with large multi-dimensional matrices and there are multiple similarity measures that are available. These include euclidean distance (think straight line between two points in a 2-D space), cosine similarity (think comparing relative angle between two vectors rooted at the origin and discounting the magnitude), Jaccard similarity (compare two sets of objects), Hamming distance (binary string compare) and many others (excellent reference with Python examples). We will use some of these later.
Let us now try to answer each of the 4 problems listed above.
My notebook is available in Github if you want to open it and follow along.
The first part of this Kaggle solution (“simple recommender”) provides an excellent way. The dataset contains the
vote_average and give a count threshold include in a simple weighted score. Link to my notebook. The top 5 movies found here certainly have name recognition and generally likely already watched by most.
It should be easy to see that this method could be influenced by “fake” ratings or hard to spot data irregularities so something to keep in mind.
Content Based Recommendation
The simplest recommendation is “because you watched Terminator you may like these movies.” This example is available all over the internet and the only thing from a user profile (if we can call it that) is that I just watched Terminator. This involves applying TF-IDF to some movie features and the cosine distance to measure similarity. Its not very intuitive but basically you can think of each row in the tf-idf matrix (the vector of floating numbers) to be the representation of whatever features you had.
The second part of the same Kaggle solution linked above is an easy to understand example. Frankly this is more to learn about similarity but its pretty good outcome based on some simple tests. Link to my notebook.
Now let us consider a more comprehensive approach. We need to build a user profile and in our case the ratings users have given to movies is the starting point (the file
ratings_small.csv). The ratings from 0 (hated it) to 5 (loved it) indicate this. This could have also been collected implicitly. For example, I watched the full YouTube video vs after 5mins I switched to something else and did not complete watching even after few days. Additionally, we also need to establish similarity between the movies (which we already did in the first attempt).
Based on ratings a user has given to movies, they are used to create a weighted average using the movies vector representation. Then a cosine distance is computed between this profile and each movie. Refer to my notebook for the code and here is example outcome for recommendations for userId 5.
Are these good recommendations? So far, we just looked at the results intuitively and said they were good but now it is not so simple. Unless these were recommendations for me, I will be unable to say how good they are. So how do we objectively measure if our recommendations are good? I will discuss this in more detail later.
Here is a sample of various movies and ratings given by users. Note that there are thousands of movies and users and typically users only rate a few movies so most values will be
NaN. This is therefore known as a “sparse” matrix.
Our goal is essentially to predict values for these missing ones. We can then recommend the ones that come up with a high rating in our predictions.
As discussed previously there are user based and item based approaches to filling these blanks. I’m using the Surprise library and it follows the Scikit conventions and after you read their docs it is nice and easy to follow. Steps to follow: pick the algorithm,
fit the training data, then
test the model to generate predictions for test data. We will discuss “accuracy” of our recommendations later so for now the training data is all the available ratings and the test data is all the
NaN values as seen above — since we want to fill this whole matrix. Refer to my notebook for the code but in summary
build_full_trainset() gives the training set and
build_anti_testset() gives the missing ones. Train and test takes quite a bit of time so see notebook for ways to reduce size by excluding movies rated by very few users and users who have rated very few movies as shown here (this problem is general and known as “long tail”).
There are two families of algorithms — memory based and model based that can help solve our problem (excellent descriptions and details).
Memory based algorithms use statistical methods and k-Nearest Neighbors (kNN) is what we will use. Intuitively you can think of finding clusters of nearest neighbors (users or items) based on cosine similarity and then predicting the rating based on some formula like mean or weighted mean. Here is the outcome for item based and user based kNN clustering for the same userId 5 as above. Note the ‘not found’ are movies rated by users but no in the movies files for some reason.
Model based algorithms perform “dimensionality reduction” of the sparse matrix with various approaches which are beyond the scope of my discussion. SVD is the method we use. Surprise provides many options and we look at more later. Here is the SVD outcome, again for userId 5.
In all three cases above looks like a good overlap so we must be doing something right? Are these good recommendations? Again, it is hard to objectively determine as we noted above.
As with any prediction model we need some way to explain and measure the recommendations provided. Understanding this in detail is beyond our scope but similar to an adjusted R-squared value for regression or AUC for a classification problem, what can we use here?
One intuitive way we can think of using is to compare the test set (where we already observed how a user rated an item) with the predicted values for these test users based on the model. So suppose user1 rated movie1 as 4.5 (noted as
y) but the model predicts 4.2 (noted as ŷ (y-hat). This page from Dataquest describes many metrics well that compare in this way:
- MSE — Mean squared error and RSME — Root mean squared error
- MAE — Mean absolute error and MAPE — Mean absolute percentage error
Using what Surprise provides, for the above recommendations the values look very similar. The next question, what is better? RMSE and MAE are in terms of the ratings scale and the smallest value is best. For SVD predicted values were closest to observed values! (Link to my notebook)
This can also be then run easily on all the algorithms Surprise provides.
There are additional methods to compare as well. For example:
- Hit Rate — compute the topN items per user and see if any match user’s rating (counting such as a hit). There are other related metrics to hit rate as well. (Reference)
- Precision and Recall @N— this is an adaptation of the confusion matrix from classification problems based on relevance of recommendations. (Reference)
One thing to note is that all these methods are based on historical data so unless you actually do an A/B test and determine the user reaction it is unknown if the recommendations are actually good!
If these topics interest you then reach out to me, and I will appreciate any feedback. If you would like to work on such problems you will generally find open roles as well! Please refer to LinkedIn.