Words According to r/Vancouver [Part I], word2vec

I have always enjoyed some of the things on r/Vancouver, the subReddit. During the Xmas holiday, I thought it’d be funny to build a few NLP projects with comments I can gather.

TL;DR , if you simply want to see some funny results, go to :

http://jiachen.io/ml/rvancouver/word2vec

Although NLP is already a subbranch of machine learning, it’s a fairly complex area with lots of researches going. This series of blog posts are about seeing NLP in practice (as how I built Talentful.ai ) as much as entertainment. I will spin up a server for the general public to access just to see how different NLP models work.

Through this and the next few blogs, here is what you will expect:

NLP Models trained off r/Vancouver comments

  1. Word2Vec trained with r/Vancouver comments. You can see how high dimensional text corpora are compressed into shorter word vectors. And interact with some of the funny results. Although you may see results that sound absolutely off the political correctness chart, it perfectly reflects the shortcoming of NLP today: the quality of the models heavily depend on the quality of the training data.
  2. A classifier that predicts whether a random comment will receive negative votes by leveraging of the power of RNN (you will see the difference between simple RNN and LSTM)
  3. A chatbot capable of producing text (Natural-language generation) by studying the text corpora of r/Vancouver comments, understanding the meaning and writing new textual sequences.

For privacy and legal reasons, I will not be able to provide any data, but the code will be provided for you to acquire the data on your own.

Some of the packages will be used include but not limited to: Keras (connected to Tensorflow), Panda, Gensim, Numpy, NLTK (sadly, not spaCy). I also have a pretty powerful workstation (64G RAM +1080 TI GPU)

Part I: Training Word2Vec Word Embedding Model

In this blog post, I will focus on building a Word2Vec model off the comment data from r/Vancouver. In the end, you will be able to ask questions such as, “What words are closely related to the word ‘vancouver’ on this subreddit?” “Conceptually, what’s subredditers’ opinion on ‘drug’ + ‘hastings’?” or “What is to ‘money’ as ‘restaurant’ to ‘foodie’?”

For any machine learning problem, the most vital part is acquiring and cleaning data. This is more so in this particular case as we are dealing with corpora different from Google News or Brown Corpus. I will explain why in a minute.

Step 1: Acquiring data

Reddit does provide APIs to interact with their site. However, the APIs are very confusing and much less robust than many commercial APIs that I have been working with such as (Github and Twitter). Further, it places a threshold on how much you can acquire, thus rendering it almost useless in this case. I want a large body of text to get accurate results. How large? I am not sure, I first used 1 million, then I saw the result is not good enough. I resorted to 10MM, but PushShift only returned approximately 2.8MM comments. Oh well, let’s get into the code,

Here is the result of the first few comments by calling cache.head()

data loaded and row count: 2765779
                                                        text  score
comment 0  If you were only renting a room you have no fa...      1
comment 1  Without a lease and paying everything under th...      1
comment 2                                            Thanks!      1
comment 3  I agree, I love the classics, they are the bes...      1
comment 4                          Four should be enough...       1

I didn’t capture the exact timestamps of the comments nor the redditors, as I only care about the corpora and score. It’d be great if I had more data. One thing to note is that I also didn’t acquire the post text bodies (which post comments were on). I only focus on comments. Why? Because the Reddit posts contain a lot of news articles that do not reflect the sentiment, wording of the Redditers. Our objective is to understand the lexicon of the users, not the posts. This would change if we plan to predict things such as what type of posts would get positive upvotes. Your training data have to be consistent with the problem you are trying to solve. If not, you are simply adding noises.

You also notice I saved the data into a file. Because I am always paranoid about losing data when my work is not finished. Evidently, you can call loadCommentData() to reload the data into memory in case your kernel crashed at some point.

Step 2: Examine and clean the data

At this point, we are almost ready. We now face the most important part of the task: prepare the training data. Every Machine Learning project requires different preparation for the data and objectives are different. For example, at Talentful.ai, I had to deal with resumes of software developers, which is different from dealing with blogs. So my processing is very different. Secondly, the objective was to find the right word embedding instead of producing a classification, therefore, I would extract different features from other projects I worked on.

By all means, we are gathering social media lexicon, which contains a variety of informal writings, spelling errors, emojis, handles, non-ASCII code. What should we do about it? Well, let’s take a look at what is in the data. Let’s build a quick NLTK CountVectorizer to take a look.

Few things to note in the code. You will see that I used calsual_tokenize instead of Treebank to tokenize the words. That is because we are dealing with social media text and the corpora contain informal writing. Second, you may note that I used ngram_range=(1,2) instead of default unigram only. This process will significantly increase the fitting time of the model. But it allows us to see some patterns and gives me some idea whether I should apply appropriate phrases processing later to make sure we are not losing important concepts/phrases. Third, you may have noted that I filtered out common English stopwords. For we are using a count vectorizer, leaving the stopwords in will easily clog the results.

The first 100 popular unigram/bigrams:

Surprisingly, they are nearly all emojis. This means the comments were mostly made via mobile instead of computers. This also gives me some clue about what to filter out later.

To turn text corpus into tokens, here are some common step involved:

  1. tokenization
  2. filter out unwanted words (such as stop words, punctuation etc.)
  3. lemmatize or stem words

Let’s get right into it:

vocabulary size: 479338 
[['renting', 'room', 'fancy', 'tenant', 'right', ''], ['without', 'lease', 'paying', 'everything', 'table', '', 'sure', 'much', 'way', 'right', '', 'since', '’', 'paperwork', 'stating', 'rent', 'house', '', '’', 'likely', 'within', 'right', 'rent', '3rd', 'room', '', 'landlord', 'allowed', 'key', 'place', '', 'give', '48', 'hour', 'notice', '', 'think', '', 'entering', 'premesis', '', 'saying', '’', 'suck', 'handled', 'better', '', 'sure', 'much', 'legal', 'ground', ''], ['thanks', ''], ['agree', '', 'love', 'classic', '', 'best', '', 'many', 'pcars', 'due', 'whole', 'porsches', '70', 'year', 'anniversary', 'thing', ''], ['four', 'enough', ''], ['hoping', '', 'quiet', 'holiday', '', 'thinking', 'led', 'bulb', 'must', 'reduced', 'fire', 'quite', 'bit', 'old', 'incandescent', 'bulb', '', 'maybe', 'still', 'people', 'getting', 'trouble', 'overdoing', 'holiday', 'feast', 'though', ''], ['enough', 'worker', 'culture', '’', 'celebrate', 'christmas', 'business', 'could', 'stay', 'open', 'really', 'wanted', '', 'low', 'demand', 'probably', '’', 'make', 'worthwhile', 'many', 'business', '', 'run', 'today', 'get', 'coffee', 'creamer', 'avert', 'true', 'christmas', 'tragedy', '', '711', 'good', 'enough', '', 'would', 'handy', 'hit', 'real', 'grocery', 'store', 'family', 'asking', 'getting', 'baked', 'good', 'breakfast', 'tomorrow', ''], ['one', 'pair'], ['funny', 'everyone', 'thread', 'hate', 'thief', 'nobody', 'prepared', 'support', 'change', 'necessary', 'prevent', 'type', 'theft', '', 'junky', 'protected', 'class', 'vancouver', '', 'adding', 'amenity', 'solve', 'problem', '', 'attracts', 'junky', 'canada', ''], ['one', 'kingsway', 'victoria', '', 'yikes', '']]

There is a lot to unpack here.

First, I tokenized the comments with a casual tokenizer from NLTK. I normalize the text by lowercasing individual corpus and strip handles (although it’s uncommon among Reddit texts. Instead, you may encounter more “/u/XXX” instead of @ to reference users).

Next, I removed the punctuation in the tokens, which is a relatively straightforward and simple process. There are still some punctuations left, however, for they are relatively rare when you put them in the grand scheme of things, I didn’t bother removing all of them.

Some practitioners may use the following two functions to remove punctuation and nonASCII code.

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words
def removeNonASCII(sent):
    words=[]
    for word in sent:        
        words.append(unicodedata.normalize('NFKD', word).encode('ascii', 'ignore'))
    return words if len(words)>0 else None

But they are problematic in our case. As you have seen, Redditers love emojis. Try “Awww, that’s awkward 😅, isn’t it?” You will see the emoji is completely removed, which is not what we want to have here. We want to persevere the emoji as they are common and they deliver important sentiments.

Then, I got rid of all the “removed” comments. On Reddit, Subreddit mods or the users themselves may remove certain unpopular comments. This adds no value to studying the lexicon of the entire comment corpora and they are quite common in the entire set. Removing them is critical.

Next, I removed the stopwords by checking against NLTK’s stopword vocabulary. You may add more vocabulary to the list as you wish. But in this task, I have no need to. Again, how you prune your text is highly specific to your objectives and context.

Now, it comes to the interesting part. English is a fairly complex language as it employs the concept of “tense” and “plurality”. To reduce the vocabulary size and noise, I decided to lemmatize the words. I could also stem the words. But, I am trying to give others the chance to interact with words that they can think of; thus, it’s not necessary. You may skip the entire lemmatization and stemming if your vocabulary/training dataset is fairly small. You will also notice that I only lemmatize noun (which is the default) not verbs. The reasons being I care about concepts more so than the actions. When Word2Vec examines enough training data unless certain verbs are always specific nouns, it should return more nouns than verbs as I asserted. I may do a lot more in a production environment to further denoise the lexicon.

Finally, you will notice a function called addBigrams(). In essence, it created and called a Gensim phrase model. If your corpora contain a lot of phrases, you may need this. Just for the fun of it. I ran the function and it turned out,

Bigram Vocabulary Size: 10522914

Without bigrams, the vocabulary size is around 479,338. This is a big contract. The bigram processing can make the tokens much more sparse in this case. While it did add marginal benefit to the model, it cost a lot of computing time.

Now, you may be wondering, isn’t on average, an adult knows between 25,000–35,000 words? Why 470K show up? That’s because this is social media, it contains numbers, spelling errors, emojis, informal writings, foreign words, the combination of numbers and English words of various length. I can clean them up further to reduce the lexicon size. However, again, I am not building a very precise model as I typically do in production.

So all and all, the blathering I have put on comes to one point: your feature extraction, data preparation is all about your data and objectives. What I put on can be miserably incorrect in your case for it is specific to the dataset I had.

Step 3: Create, train and save a Word2Vec model with Gensim

Gensim makes it very easy to build Word2Vec model. In less than 2 minutes, I can create one.

Notice I set the number of features (size) to 500. Typically, it’s between 100–300. Anything more than that brings the diminishing return. However, since I have a powerful machine, hell with it.

[('heroin', 0.8245717287063599), ('opioids', 0.8080341815948486), ('opiate', 0.8065546751022339), ('narcotic', 0.7980027794837952), ('cocaine', 0.7960213422775269), ('substance', 0.7801455855369568), ('opioid', 0.7792735695838928), ('heroine', 0.7756026387214661), ('fentanyl', 0.7754548192024231), ('overdosing', 0.7655848264694214), ('oxy', 0.7582300305366516), ('prostitution', 0.7582054734230042), ('mdma', 0.7489495873451233), ('fent', 0.7473188638687134), ('methamphetamine', 0.7472093105316162), ('methadone', 0.7469868063926697), ('meth', 0.7462027072906494), ('addiction', 0.7460451126098633), ('opiods', 0.7457476854324341), ('alcoholism', 0.7450610399246216)]

This is where funny things begin. To Redditers on r/Vancouver, their views of drugs are fairly spot on.

In the classic example of Word2Vec, king-man + woman = queen. What is it on r/Vancouver? They did not disappoint:

queryWords(positive=["king", "woman"], negative=["man"], topn=20)
[('costanza', 0.7922869324684143), ('soros', 0.7831655740737915), ('takei', 0.7725933790206909), ('wainborn', 0.7656800746917725), ('carlin', 0.7636248469352722), ('clooney', 0.7580316662788391), ('gretes', 0.7495653033256531), ('rammell', 0.7486284375190735), ('heyman', 0.7447643280029297), ('jetson', 0.7397222518920898), ('zimmerman', 0.7369969487190247), ('puil', 0.7369533181190491), ('prince', 0.7357513308525085), ('orwell', 0.7245429754257202), ('sheeran', 0.7203205823898315), ('33rd', 0.7150120139122009), ('sr', 0.7117825150489807), ('57th', 0.7110642194747925), ('blvd', 0.7072034478187561), ('41st', 0.702060341835022)]

There is a lot of talk about Chinese investors buying up houses in Vancouver. Let’s see what are their impressions?

queryWords(positive=["chinese", "house"], negative=None, topn=20)
[('mansion', 0.49464866518974304), ('home', 0.46124938130378723), ('multimillionaires', 0.4556581676006317), ('foreigner', 0.45450925827026367), ('condo', 0.4456098973751068), ('wealthy', 0.4434904456138611), ('westside', 0.4430326223373413), ('millionaire', 0.4400303363800049), ('china', 0.43974569439888), ('property', 0.4393745958805084), ('bungalow', 0.43804118037223816), ('hk', 0.43746697902679443), ('overseas', 0.4370615780353546), ('teardowns', 0.4307893216609955), ('billionaire', 0.4271533787250519), ('homemaker', 0.4266085624694824), ('asian', 0.42537936568260193), ('sfhs', 0.4233153462409973), ('iranian', 0.42309990525245667), ('shaughnessy', 0.42088112235069275)]

It looks like according to r/Vancouver, houses owned by Chinese must be “mansion”.

What about Chinese people without money?

queryWords(positive=["chinese"], negative=["money"], topn=20)
[('hongkongers', 1.7273292541503906), ('slavic', 1.7047483921051025), ('taiwanese', 1.6945977210998535), ('1907', 1.6845422983169556), ('chine', 1.6384543180465698), ('distinguishes', 1.6222048997879028), ('japanese', 1.6158061027526855), ('dialect', 1.6088076829910278), ('antiasian', 1.6048954725265503), ('antichinese', 1.6030734777450562), ('tibetan', 1.602021336555481), ('malay', 1.601813554763794), ('asian', 1.5999736785888672), ('asain', 1.5998543500900269), ('sichuan', 1.5976003408432007), ('arabic', 1.5966078042984009), ('uyghur', 1.5946550369262695), ('swahili', 1.5945265293121338), ('komagata', 1.5940673351287842), ('fijian', 1.5935890674591064)]

It looks like their impression of Chinese people without money must be from Hong Kong or Taiwan. 😂 😹

What about dogs in Vancouver?

queryWords(positive=["dog", "vancouver"], negative=None, topn=20)
[('offleash', 0.3911096155643463), ('breed', 0.3889022469520569), ('pet', 0.38839638233184814), ('city', 0.38739466667175293), ('vancouverites', 0.3827388286590576), ('geographically', 0.37591075897216797), ('mutt', 0.3759056329727173), ('van', 0.3756505250930786), ('onleash', 0.3747195601463318), ('pitbulls', 0.37297990918159485), ('area', 0.37186378240585327), ('nondog', 0.3712121248245239), ('chihuahua', 0.36971208453178406), ('petfriendly', 0.3694041669368744), ('place', 0.3684462606906891), ('corgi', 0.3671013116836548), ('gva', 0.36580806970596313), ('burbs', 0.3653530776500702), ('unfriendly', 0.3638651967048645), ('suburb', 0.36223104596138)]

Wanna play it yourself? Here you go:

http://jiachen.io/ml/rvancouver/word2vec