Project Update - Generative NLP using Chrissy Teigen Tweets
Roughly a month ago, I decided to flex my machine learning skills by developing a Chrissy Teigen bot based on tweets she had already sent. I figured it would take a couple of days - a week, max?
Yes, I am kidding. I've used machine learning in the past for magnetic resonance images (see conference paper here), but if I'm being totally honest, I didn't fully understand what I was doing, as I never had formal training in machine learning. I could (and might) write a whole blog article on why the fact that you don't really need to know anything about machine learning to develop models is, in my opinion, a double-edged sword.
In any case, I decided to start from the beginning. Last semester, I took a Stanford's Machine Learning course in Coursera, taught by Dr. Andrew Ng, and have been reading ML papers as often as I can to supplement my learning. I'm in the process of taking the Neural Network course in Coursera now, but I wanted to do a more open-ended project at the same time. For a while, I was at a loss at what I should do, but in a burst of inspiration while on Twitter (where I spend entirely too much time), I decided to feed tweets to a recurrent neural network and see what it spat out.
Specifically, Chrissy Teigen's tweets.
Since then, I've been in the process of developing a generative recurrent neural network (which falls under the umbrella of natural language processing, or NLP) to create "fake" Chrissy Teigen tweets based on the real tweets that I give it. Ideally, I would like to be able to interface this program with a Twitter bot once the generated results look like something Chrissy Teigen would actually say, but before I can do that, the model needs to work correctly and accurately.
For those who are interested in replicating this, I'm using Keras with a TensorFlow backend, and the GitHub repository can be found here.
First Iteration: Keep It Simple
I'll explain the jargon as we go, but let's start with my first approach. As a first attempt, I used the Bag-of-Words method.
The Bag-of-Words is exactly what it sounds like. You take a set of sentences, break the sentences down into words, and use the "bag" or words as your training data. The model looks at how frequently each word appears, and pulls words out of the bag that might represent a sentence.
This approach can be made more or less complicated with pre-processing and feature extraction steps, where you might make connections between words, parts of speech, or semantics, but for my first attempt, I went extremely basic. I pulled ~27,000 of Chrissy Teigen's most recent tweets, and trained the model on the 1000 most frequent tokens that appeared in Chrissy Teigen's tweets. Tokens are groups of characters (or individual characters, or individual sentences, depending on what you are trying to do) that appear in your data set.
Here's where I learned my first lesson - Words and tokens are not always the same thing. Words are tokens, but so are punctuation and emojis, which also appear in tweets. Because I used the 1000 most frequent tokens, this included a lot of punctuation, emojis, and @ or # symbols. This is not surprising - all of these characters are frequently used in tweets. However, they may be less important to your model than things like prepositions and conjunctions, which can occur at similar frequencies. In other words, I needed to pre-process my data to account for punctuation and emojis.
I won't show the results of that training session here, but most of the generated tweets were odd combinations of periods, @ symbols, and hashtags. To mitigate this, I added a pre-processing step that removes all of the punctuation from the tweets before breaking them down into tokens, and tried training again. This time it went better. The generated phrases looked... more accurate?
However, the loss and accuracy graphs made some issues apparent:
For the sake of developing a model, the second graph (Loss) shows some concerning trends. The training loss is decreasing as expected, but the validation loss is actually going up. In other words, the model is too specific - it works really well on tweets that it has already seen, but does not work on tweets it hasn't. Given that the model seems to be fitting the training data relatively well but is absolutely failing on the test data, I'm pretty sure that I am overfitting my training data.
Another aspect of the data that I looked at were word embeddings. Words that are closer together on the graph are more similar or are used in similar situations by the writer, so it's not surprising that "love", "happy", and "great" are close together. Nor is it surprising that "president" is very far away from that cluster. It is somewhat surprising that "thing" is also in that cluster, but "fascinated" is much further away. One of my planned next steps is to add more words to this plot, in the hopes of drawing more connections between terms in her Twitter vocabulary.
Second Iteration: Adding More Features
The second approach involved some pre-processing. Instead of only including tokens in the model, I labeled them with their respective parts of speech, grouping them into word classes. More information on the grouping method can be found here.
This change did not seem to result in higher accuracies or lower losses compared to the previous model. so we're still overfitting the training data with this new model.
Interestingly, the generated tweets also got much worse.
And now we've caught up to my last repository commit. What have we learned so far? Well, we can extract some interesting subject-specific associations between word usage using word embeddings. Chrissy Teigen uses the terms "love," "great," and "thing" in similar situations, but uses the terms "president" and "fascinated" in very different ones, which implies a negative connotation for the latter terms. This could also be extrapolated to other subjects or other platforms - Do the associations change if you give the network transcripts of Teigen's speeches or John Legend's tweets? In terms of the model, we've learned that it works very well on the training data, but not too well on the validation data. This could be caused by using an inaccurate loss function, using the wrong amount of training data, or not having enough variables for the model to look at.
So, what are my next steps?
Well, one thing that I'd like to do is look into the math behind my loss and accuracy functions. This is an unsupervised model, so it is not making predictions of text and then comparing the prediction to the real result. I am currently using categorical cross-entropy and ADAMAX, but there may be other loss/accuracy functions that make more sense for my data/model.
Another thing I'd like to do is look at semantics. POS tagging has definitely helped the model figure out what parts make up a sentence, but it hasn't figured out how to choose the right words for each part based on context. Extracting semantics-based features should help it create sentences that make a bit more sense.
Thanks for reading!