In this blog, we will talk about how to use two different machine learning techniques (LDA and Word2Vec) to analyze and query the topics in sales emails (or all emails). I will also share the code (<50 lines) for you to start querying your emails immediately. You don't need any machine learning background to proceed.

For revenue operations professionals, it's crucial to understand sales reps' day to day activities and continually align strategy and execution. Tableau and Birst are excellent sources of insight. But often, scattered data such as call and email logs are ignored as they are harder to analyze and model. But we all know, these are valuable data. They contain day-to-day interaction between our frontline sales team and prospects. The sentiment, the topics, the frequency can all be good indicators of the sales performance.


In this blog, I present no more than 50 lines of machine learning code for you to immediately analyze what is inside these treasure troves.
We will use two types of machine learning techniques to achieve the goal:

  1. Latent Dirichlet allocation (LDA)
  2. Word2Vec

What are LDA and Word2Vec in layman's terms?

LDA intakes a group of documents (texts) and studies the patterns of the words used in these documents; it then outputs words in a set number of topics topic. For example, you can ask LDA to learn 5 topics of 100 documents. Then you can print out the top 100 words for each topic.

Word2Vec is a group of word embedding algorithms that intake groups of documents, studies the patterns and turn each word into a numeric representation. Because it embed each word into numeric values; each of the numeric value represents a context; it can be subsequently used to understand related words/concepts. For example, Queen = King - Man + Woman

What you need to begin

  • Python3 installed
  • You have installed the following python packages:
    • Pandas
    • Spacy
    • Gensim

Load the packages and data

The code before line 10 are pretty self-explanatory, you are merely loading, Pandas, a data analysis library for handling large data set; Spacy, a natural language processing (a type of machine learning dealing with text) library; and Gensim (topic modeling library).


In my example, my file is called "p1.csv," and all the emails are under a column called "text_body". You can change the filename and column name correspondingly.

Using LDA to model the topics

Let's spend a moment to talk about what's going on here. The first function you will see is called tokenizeDoc. It tokenizes a document (in our case, an email). What is tokenization? It this context, it means chopping an email into smaller pieces that machine learning models can digest. For example, after tokenization, "Softmax Data builds custom machine learning solutions for revops" becomes ["softmax", "data", "builds", "custom".....]. Another thing to notice is that our function also converts upper and sentencing cases into lower cases, as well as removes "stop words" such as "is" and "for." It is an important step because we don't want cases to dilute the patterns machine learning algorithms are looking for. Meanwhile, those "stop words" often appear at high frequency but add very little value to the final model.

Next, buildDictAndModelLDA is relatively straightforward, where it builds an LDA model. First, it creates a dictionary (total vocabulary of the texts the model will see. ). Then, it converts each email into a Bag-of-Words (BOW) representation. BOW is a way to encode text into numbers that computers can understand. If you are interested in what it is, check out here. Finally, Gensim uses the BOW encoded corpus and vocabulary train an LDA model. You will notice two parameters the function intakes: numberOfTopics and numberOfPasses. The first one as the name suggests, is asking you how many topics you want the model to train. By default, it's five topics. The model will try to cut all words into 5 different topics. You will later be able to query words of each topic. The second parameter is number of passes you want the model to run. Each pass is one iteration that the model will use the data to train/refine the model. The more passes it runs, the more accurate it becomes. But bear in mind, there is a diminishing return at some points. We use 100 passes by default; you can also increase if you have sufficient computing power.

Once the model is built, you can call getTopicsLDA to query words related to each topic. In addition to the two parameters I mentioned above, there is one more parameter called numberOfWordsPerTopic, which is asking you how many words do you want to print out for each topic.

Ok, ok, so far, we print out X number of topics and Y number of words each topic. You can have a basic grasp of what's going on. What if you want to query further? What if you want to see all the words related to one specific word? What if you want to have other interactions? This is where Word2Vec comes in.

Using Word2Vec to model the topics

buildModelWord2Vec builds a Word2Vec model. It intakes the training data and three additional parameters. Before we go into what these parameters are, let's spend a brief moment understanding how Word2Vec works.

The goal is Word2Vec is to convert a word into a numeric representation, namely a vector. So for example, "dog" can be represented as [0.89, 0.6, 0.1, 0.9]. Each of these numbers could mean something, for example, [Canine, Mammal, Cat, Pet]. Each number represents a degree in that particular dimension. When you encode all words in these contextual vectors, you can now ask the model how close are two words or what is the result of one word plus another word. A good graphic illustration can be found here. So, the number of numeric values in this vector or size is one of the parameters (numberOfFeatures) you need to specify. By default, we use 100. So each word turns into a 100 numeric values vector. It's not true that the bigger this number, the better. Let's imagine that this number is 100,000. Then not only you increase the workload to train the model, to query the model, you also making the vector very sparse, which is bad for accurate prediction.

To train a Word2Vec model, the computer takes into account how words are often associated together. Because of the concept of "association", it needs something called "a window". For example, if our window size is 3. You can imagine the computer is scanning a sentence "Softmax Data builds machine learning solutions." and read three words at a time "softmax data builds", "data builds machine", "builds machine learning", "machine learning solutions". This is the second parameter (windowSize). You can choose a big number for this, but if you go too big, things are starting to lose context. For example, you read 1000 words at once, then each email on its own (presume it's around 500-600 words), becomes a piece of information for the computer to detect the pattern. It just won't work. Too small, you are read only a few words at a time, then it becomes very noisy.

The third parameter, minimumFrequency, is a quality control measure, which is to ask the model to ignore words that appear less frequent than the number you specified.

If you are still unsure what these numbers are, use the default.

Once the model is trained, you can now start to query the emails in exciting ways by calling queryWordsWord2Vec.

There are some parameters you will need to supply, which also makes it very versatile.

The first parameter is called positive, meaning what words you want to query.

For example, positive=["ecommerce", "magento"], which means, you want to know the words that are associated with "ecommerce" and "magento".

Conversely, negative, means words you want to exclude. For example, positive=["ecommerce", "magento"], negative=["conversion"], means you want to know the words associated with "ecommerce", "magento" but not closely associated with "conversion".

Another famous example of Word2Vec is:

positive=["king", "woman"], negative=["man"] = "queen"

or

King - Man + Woman = Queen

The final parameter, topn, means top N similar words you want to return. If let as None, it will return all words.

Next step?

The examples above are drastically simplified. Nevertheless, it can give you a good start to understand what's inside your sales email. You will need to experiment with different parameters to achieve the most optimal results. If you are interested in learning more, book a free meeting with our expert to talk about how we can apply machine learning in helping your revenue operations at https://calendly.com/softmax/execbriefing

Happy Thursday!