This post describes a language model that I have created for generating funk lyrics. This is a popular music genre in Brazil, I have created a medium post in portuguese describing the process of collecting the data and generating the songs. Therefore, my motivation in this posts is to explain how I created this model and discuss some decisions I have made during the development of this project.
I have not used any portuguese pre-trained word embedding for this problem. The main rationale behind this decisions is that most portuguese word embeddings that I know are trained with the wikipedia corpus. Although they provide a rich set of words and context for those words, I don’t believe they would make my model significantly better. The reason for that is that my main source of information, the funk lyrics, are vastly different than wikipedia articles. Therefore, I have not used pre-trained embeddings, but I have not excluded this option from future experiments.
I removed all punctuation and words that not appear at least 5 times in my dataset from my dataset. Furthermore, I split the songs into chunks of 32 words, and I have used these chunks as the training data of my model.
Additionally, I have added special tokens in the beginning and ending of my songs, while replacing any word that doesn’t appear in my vocabulary for a token representing an unknown word.
I have used a Recurrent Neural Network (RNN) to train the model with an LSTM cell for generating the state vector of the RNN. I have stacked 3 of this layers together, where each layer has 728 units.
Furthermore I have used dropout for regularizing my network. I have applied dropout to the recurrent step of my network, using the variational recurrent dropout technique , and I have applied dropout to the output of my RNNs. Also, I have applied dropout to the embeddings matrix as well, as stated in the variational dropout paper . I have used the value of 0.5 as the probability mask for the described dropout approaches.
I have also tried to use L2 regularization in my model, but I have found that it added a heavy burden on the model capacity, so I decided to remove that type of regularization.
Finally, I have used the weight tying technique  to remove the necessity of maintaining two distinct embedding matrices in my model.
I have decided not to use a validation or a test set for this task. The main reason behind that initially, when I used a validation and test set, I was looking at the perplexity metric to understand if my model was overfitting. However, I realized that this metric did not have a good correlation with the songs my model was generating. Meaning that the quality of my songs was not getting better if my perplexity value was decreasing. I realized that using the whole dataset to train the model achieved a better overall quality of the songs being produced.
However, I must add that using the validation and test set was a really good choice on the first steps of the project, since I used these sets to define the number of RNN to use, the size of units in these networks, and that L2 regularization was not giving me benefits.
Finally, I must add that the way I was evaluating if a song was good or not was totally subjective. I believe this a problem, since this made my decision making problem a lot harder in this project. If I start any language modelling problem again, I will invest a great deal of time on this subject, to understand a better way to evaluate sentences generated by the model.
Since I wanted to deploy this model in production, I have considered using TensorFlow Serving for this task. However, I realized that this not an appropriate choice for this task, since I would need to generate a number of requests for the server, in order to keep generating new words, which was not ideal.
Therefore, I have created a docker container using both flask and gunicorn to serve the application. Although the contained I have generated is really big (more that 1.5g), I think that this will not be a major problem, since I am not aiming at scalability here. However, using external volumes for the model checkpoint, instead of including it in the container should provide a better alternative than the one I took.
This was I really hard problem for me to solve. Not only collecting the data took a lot of time, my decision making process was not as good as it seems. I believe that this project, although successful, made me realize some problems in using Deep Learning with Natural Language Processing, which is that finding the right metric to optimize is a real difficult task.
But with that difficult I am now motivated to further research into this topic and better understand how other people or research groups have handled this issue.