Movie Recommendation System

9 min readMar 1, 2021

this project was build by Neon team : Aisha hakami, Monirah Bin Taleb,Mohammed Al-Ali and Lama Alharbi.

In our daily life when we are shopping online, or looking for a movie to watch, we normally ask our friends or search for it. And when they recommend something that we don’t like yet they enjoyed it. what a waste of time right !!. So what about if there is a system that can understand you, and recommend for you based on your interests, that would be soo cool isn’t it. Well, that exactly what recommender systems are made for.

Coming to the streaming shows websites, understanding the user’s behavior always consider a challenge. Irrespective of gender, age, or geographical location everyone enjoying watching movies at home. Bunch of people like genre-specific movies to be romance, action, or comedy while others enjoying the lead actors and directors’ visions. We all are connected via this wonderful medium. However, what most exciting is the fact that how distinctive our choices and combinations are in terms of the show’s preferences. But, when we take all that has been said, it’s remarkably difficult to generalize a movie and say that everyone would like it. So here the recommender systems shine to act as special assistants for the user’s needs.

We found out that STC launched an Open Data initiative to support those who are interested in Data Science, the data is associated with Jawwy, which is their IPTV service. And we decided that this will be the beginning of the neon recommendation system journey. The data consist of 29,487 user behavior within 8k shows all in 3m row. we extract the juice out of all the behavioral patterns of not only the users but also from the movies themselves.

Modeling

We decided that we want to use 6 models to meet user preferences as much as we can those models are :

Collaborative filters

This type of filter is based on users’ watching history, and it will recommend us movies that we haven’t watched yet, but users similar to us have and like. To determine whether two users are similar or not, this filter considers the movies both of them watched. By looking at the movies in common, this type of algorithm will basically recommend the movie for a user who hasn’t watched it yet, based on the similar users’ watching history.

Pros:

Collaborative filtering systems work by people in the system, and it is expected that people to be better at evaluating information than a computed function.

Cons:

Cold Start: a major challenge of the Collaborative Filtering technique can be how to make recommendations for a new user who has recently entered the system; that is called the cold-start user problem.

4. Popularity bias:

Cannot recommend items to someone with a unique taste.
Tends to recommend popular items.

Content-based filters

Based on what we like, the algorithm will simply pick a movie with similar content(story description )to recommend us. This type of filter does not involve other users.

Pros :

No cold-start problem, unlike Collaborative Filtering, if the programs have sufficient descriptions, we avoid the “new item problem”.
Able to recommend to users with unique tastes

Cons :

Content-Based tend to over-specialization: they will recommend items similar to those already consumed, with a tendency of creating a “filter bubble”.
Never recommends items outside the user’s content profile, people might have multiple interests.

Popularity-Based Recommendation System

It is a type of recommendation system which works on the principle of popularity and or anything which is in trend. These systems check about the movie which is in trend or are most popular among the users and directly recommend those.
For example, if most users often watch a program then the recommendation system will get to know that the program is most popular so for every new user, the recommendation system will recommend that program to that user.

Hybrid filtering :

Overcomes previous cons.
Create a weighted recommender (weights are chosen equally, combining the results of predict_cf and predict_cn).
Create differently weighted recommender (weights are chosen equally, combining the results of predict_cf, predict_cn, and predict_popularity)
Create recommender based on popularity with weighted CL and CB (which CL and CB predictions affected by popularity prediction)
Keep the strengths of CL, CB, and popularity models.
Overcomes CL, CB, and popularity models cons.

indices = pd.Series(df_program_desc.index)

#  defining the function that takes in movie title 
# as input and returns the top 10 recommended movies
def recommendations(title,type_of_recommendation,  cosine_sim = cosine_sim, cosine_sim_w = cosine_sim_w, prec_watch_mat = prec_watch_mat):
    '''
    type_of_recommendation values:
        
        0: the similarity scores of program description (Content Based),
        1: the similarity scores of watch history (Collaborative filtering),
        2: the similarity scores between watch history and program description (Hybrid),
        3: the similarity scores between watch history and program description 
           with popularity of program as indepent variables (Hybrid),
        4: the similarity scores between watch history and program description 
           as depent variables on popularity of program (Hybrid).
    '''
    
    
    # initializing the empty list of recommended movies
    recommended_movies = []
    
    # gettin the index of the movie that matches the title
    idx = indices[indices == title].index[0]
    
    if type_of_recommendation == 0:
        
        # creating a Series with the similarity scores of program description in descending order
        score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    
    elif  type_of_recommendation == 1:
        
        # creating a Series with the similarity scores of watch history in descending order
        score_series = pd.Series(cosine_sim_w[idx]).sort_values(ascending = False)
    
    elif  type_of_recommendation == 2:
        
        # creating a Series with the similarity scores between watch history and program description in descending order
        score_series = pd.Series(cosine_sim_w[idx]*0.5 + cosine_sim[idx]*0.5).sort_values(ascending = False)

    elif  type_of_recommendation == 3:
        
        # creating a Series with the similarity scores between watch history and program description 
        # with popularity of program as indepent variables in descending order
        score_series = pd.Series(cosine_sim_w[idx]*0.33 + cosine_sim[idx]*0.33 + prec_watch_mat*0.34).sort_values(ascending = False)

    elif  type_of_recommendation == 4:
        
        # creating a Series with the similarity scores between watch history and program description 
        # as depent variables on popularity of program in descending order
        score_series = pd.Series((cosine_sim_w[idx]*0.5 + cosine_sim[idx]*0.5) * prec_watch_mat).sort_values(ascending = False)
    
    else:
        print('You have entered wrong value')
        return

    
    # getting the indexes of the 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(df_program_desc.index)[i])
        
    return recommended_movies

How does Cosine Similarity work?

All the filters above will be using the Cosine similarity which is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, the higher the cosine similarity.

# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix_desc = count.fit_transform(df_program_desc['cleaned_desc'])

# generating the cosine similarity matrix on program description
cosine_sim = cosine_similarity(count_matrix_desc, count_matrix_desc)

# generating the cosine similarity matrix on watch history
cosine_sim_w = cosine_similarity(watch_crosstab_transpose.values, watch_crosstab_transpose.values)

Surprise

We also used this library which is a Python scikit for building and analyzing recommender systems that deal with explicit rating data.

The surprise was designed with the following purposes in mind:

Give users perfect control over their experiments.
Provide various ready-to-use prediction algorithms such as baseline algorithms, neighborhood methods.
Make it easy to implement new algorithm ideas.

Data engineering:

In order to do apply all the approaches above we needed to add some features that the original data fail to have.

so we used an external dataset to get the description and total time duration for each movie. The column with movie names needs cleaning. Everything needs to be lowercase to avoid duplications when we merge all the tables. we also cleaned the description

# Function for removing NonAscii characters
def _removeNonAscii(s):
    return "".join(i for i in s if  ord(i)<128)

# Function for converting into lower case
def make_lower_case(text):
    return text.lower()

# Function for removing stop words
def remove_stop_words(text):
    text = text.split()
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops]
    text = " ".join(text)
    return text

# Function for removing punctuation
def remove_punctuation(text):
    tokenizer = RegexpTokenizer(r'\w+')
    text = tokenizer.tokenize(text)
    text = " ".join(text)
    return text

# Function for removing the html tags
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

# Applying all the functions in description and storing as a cleaned_desc
newdf['cleaned_desc'] = newdf['description'].apply(_removeNonAscii)
newdf['cleaned_desc'] = newdf.cleaned_desc.apply(func = make_lower_case)
newdf['cleaned_desc'] = newdf.cleaned_desc.apply(func = remove_stop_words)
newdf['cleaned_desc'] = newdf.cleaned_desc.apply(func=remove_punctuation)
newdf['cleaned_desc'] = newdf.cleaned_desc.apply(func=remove_html)

In addition, to use the surprise library we need a rating from users and since we don't have it in our data we decided to make an alternative which will be whether the user watched more than half a movie then he like it otherwise not. we extracted minute by using (regex) and convert it to the appropriate type and then convert it from min to sec. we calculate the watching average to use it as user rating.

# Extract minute by usinf (regex) and convert to appropriate type  
merged_df_movie['total_duration'] = merged_df_movie['duration'].str.replace(r'min', '')
merged_df_movie['duration_seconds'] = pd.to_numeric((merged_df_movie['duration_seconds']) , errors='coerce').astype('Int64')
merged_df_movie['total_duration'] = pd.to_numeric((merged_df_movie['total_duration']) , errors='coerce').astype('Int64')

# convert from min to sec
merged_df_movie['total_duration']=(merged_df_movie['total_duration']*60)

Final data set :

we used the HIt-RATE as an evaluation metric for the cosine matrix algorithms

The Hit-Rate Generate the top n recommendation for a user and compare them to those the user watched. If they match then increase the hit rate by 1, do this for the complete dataset to get the hit rate. Since we want to recommend movies that new to the user so the closest to 0 the better. We sum the number of hits for each movie in our top-N list and divide by the total number of movies.

Since we want to recommend movies that new to the user so the closest to 0 the better. The content-based model represents a good score. while the collaborative not that good. The hybrid model keeps the strength of the previous models and got a good score. the hybrid and popularity is somehow biased to user preferences and represent a not bad score as well.

For surprise

We use RMSE to evaluate the SURPRISE model with many algorithms like SVDPP which extension of SVD taking into account implicit ratings.

We developed a website demo that was written in three frameworks starting with a flask that deals with python models then send the result to ajax that is connected to API to grab the movies posters and details. Finally rendering all that to HTML to show it in this elegant style :)

you cant try it from here

Thank you for reading this I hope you like our work!!

Thanks to my team friends for working so hard to do this and for our supervisor: Mukesh Mithrakumar