Automatic Star Rating for Movie Reviews - Part 1
One of my favorite hobbies in the world is watching movies. Usually, if I stay one week without watching a new film I feel that something is wrong. After watching a movie, I read reviews from different critics on the Internet, to look into different opinions about the movie I just saw or just to learn something new about movies in general. Most of the critics that I read rate their movies using the “star” rating, where a film gets a rate ranging from one to five stars.
However, this rating system has some problems. I have seen some critics complaining that the text they wrote does not reflect the grade the movie received. The following tweet is a perfect example of that problem:
"Você deu 3 estrelas pra La La Land e 5 pro (filme do qual não gostei)".— Pablo Villaça (@pablovillaca) January 18, 2017
1) Não "dei estrelas". Escrevi 1600 palavras
2) ODEIO cotações.
The tweet can be translated to:
"Why did you give 3 starts to La La Land and 5 start to (movie which I didn't like)" 1)I did not *give stars*. I wrote 1600 words. 2) I HATE ratings.
It is easy to understand why some critics may find this task difficult. Not only comprising the whole review into a single grade is difficult, but there is also the requirements of some readers about consistency, as can be seen in the above tweet.
With that problem in hands, I have started a project to see if you could generate this kind of rating from a film review using Machine Learning techniques.
However, to use Machine Learning, we need data. For this problem, our data were obtained from three different websites that publish movie reviews with this type of rating (All of the websites are in Brazilian Portuguese):
After crawling these websites for reviews, I reached this amount of data:
Which is a really small dataset. But there is also a second issue with this dataset, as can be seen on the following graph:
The rating’s classes are not balanced. There are way more reviews with “3” stars than ones with “1” and “5” stars. However, I already expected that result. There will be way more okay movies than awful or excellent ones. This trend can also be perfectly observed when we look at the distribution of ratings on each website:
*(The review’s ratings from Cineclick have float values, such as 2.5 stars. In that case I have rounded the review upward)*
We can see that most websites follow this trend on distributing the reviews ratings, except for Cinema Em Cena, which has a higher rating of “4” stars than “3” stars.
Finally, we must also look to the average size of the reviews when considering the model we will choose:
We can see that the average number of words in each review is quite high, especially for the reviews from Cinema Em Cena.
We now have a small and unbalanced dataset and with reviews with a relatively high number of words. Therefore, I don’t believe that a simple model and a simple training approach will be enough to solve this problem.
However, before throwing Deep Learning into this problem, in the second post of this project, I will create a simple baseline with a bag of words model and try some approaches on how to handle the unbalanced dataset. Stay tuned :)
(If you want to better understand how am I downloading the files, the format of each review file and how all of these graphs were generated, please take a look at the Github page for this project. PRs are always welcome too)