assignment 4 word embeddings

Assignment 4: Word Embeddings

Welcome to the fourth (and last) programming assignment of Course 2!

In this assignment, you will practice how to compute word embeddings and use them for sentiment analysis.

  • To implement sentiment analysis, you can go beyond counting the number of positive words and negative words.
  • You can find a way to represent each word numerically, by a vector.
  • The vector could then represent syntactic (i.e. parts of speech) and semantic (i.e. meaning) structures.

In this assignment, you will explore a classic way of generating word embeddings or representations.

  • You will implement a famous model called the continuous bag of words (CBOW) model.

By completing this assignment you will:

  • Train word vectors from scratch.
  • Learn how to create batches of data.
  • Understand how backpropagation works.
  • Plot and visualize your learned word vectors.

Knowing how to train these models will give you a better understanding of word vectors, which are building blocks to many applications in natural language processing.

  • 1 The Continuous bag of words model
  • Exercise 01
  • Exercise 02
  • Exercise 03
  • 2.3 Cost Function
  • Exercise 04
  • Exercise 05
  • 3 Visualizing the word vectors

1. The Continuous bag of words model

Let's take a look at the following sentence:

'I am happy because I am learning' .
  • In continuous bag of words (CBOW) modeling, we try to predict the center word given a few context words (the words around the center word).
  • For example, if you were to choose a context half-size of say C = 2 C = 2 C = 2 , then you would try to predict the word happy given the context that includes 2 words before and 2 words after the center word:
C C C words before: [I, am]
C C C words after: [because, I]
  • In other words:

c o n t e x t = [ I , a m , b e c a u s e , I ] context = [I,am, because, I] co n t e x t = [ I , am , b ec a u se , I ] t a r g e t = h a p p y target = happy t a r g e t = ha pp y

The structure of your model will look like this:

alternate text

Where x ˉ \bar x x ˉ is the average of all the one hot vectors of the context words.

alternate text

Once you have encoded all the context words, you can use x ˉ \bar x x ˉ as the input to your model.

The architecture you will be implementing is as follows:

\begin{align} h &= W_1 \ X + b_1 \tag{1} \ a &= ReLU(h) \tag{2} \ z &= W_2 \ a + b_2 \tag{3} \ \hat y &= softmax(z) \tag{4} \ \end{align}

Assignment #4 Solutions ¶

Deep Learning / Spring 1398, Iran University of Science and Technology

Please pay attention to these notes:

  • Assignment Due: 1398/03/17 23:59
  • If you need any additional information, please review the assignment page on the course website.
  • The items you need to answer are highlighted in red and the coding parts you need to implement are denoted by: ######################################## # Put your implementation here # ########################################
  • We always recommend co-operation and discussion in groups for assignments. However, each student has to finish all the questions by him/herself. If our matching system identifies any sort of copying, you'll be responsible for consequences. So, please mention his/her name if you have a team-mate.
  • Students who audit this course should submit their assignments like other students to be qualified for attending the rest of the sessions.
  • Finding any sort of copying will zero down that assignment grade and also will be counted as two negative assignment for your final score.
  • When you are ready to submit, please follow the instructions at the end of this notebook.
  • If you have any questions about this assignment, feel free to drop us a line. You may also post your questions on the course's forum page.
  • You must run this notebook on Google Colab platform; there are some dependencies to Google Colab VM for some of the libraries.
  • Before starting to work on the assignment please fill your name in the next section AND Remember to RUN the cell.

Assignment Page: https://iust-deep-learning.github.io/972/assignments/04_nlp_intro

Course Forum: https://groups.google.com/forum/#!forum/dl972/

Fill your information here & run the cell

1. Word2vec ¶

In any NLP task with neural networks involved, we need a numerical representation of our input (which are mainly words). A naive solution would be to use a huge one-hot vector with the same size as our vocabulary, each element representing one word. But this sparse representation is a poor usage of a huge multidimentional space as it does not contain any usefull information about the meaning and semantics of a word. This is where word embedding comes in handy.

1.1 What is word embedding? ¶

Embeddings are another way of representing vocabulary in a lower dimentional (compared to one-hot representation) continuous space. The goal is to have similar vectors for the words with similar meanings (so the elements of the vector actually carry some information about the meaning of the words). The question is, how are we going to achieve such representations? The idea is simple but elegant: The words appearing in the same context are likely to have similar meanings.

So how can we use this idea to learn word vectors?

1.2 How to train? ¶

We are going to train a simple neural network with a single hidden layer to perform a certain task, but then we’re not actually going to use that neural network for the task we trained it on! Instead, the goal is actually just to learn the weights of the hidden layer and use this hidden layer as our word representation vector.

So lets talk about this "fake" task. We’re going to train the neural network to do the following: given a specific word (the input word), the network is going to tell us the probability for every word in our vocabulary of being near to this given word (be one of its context words). So the network is going to look somthing like this (considering that our vocabulary size is 10000):

By training the network on this task, the words which appear in similar contexts are forced to have similar values in the hidden layer since they are going to give similar outputs, so we can use this hidden layer values as our word representation.

  • This approach is called skip-gram . There is another similar but slightly different approach called CBOW . Read about CBOW and explain its general idea:

$\color{red}{\text{Write your answer here}}$ </br> CBOW is very similar to skip-gram, the difference is in the task we are training our model on. In skip-gram we ask model for the context words given the center word, but in CBOW we ask model for the center word given context words! Skip-gram works well with small amount of the training data and represents well even rare words or phrases. On the other hand CBOW is several times faster to train than the skip-gram and has slightly better accuracy for the frequent words.

1.3 A practical challenge with softmax activation ¶

Softmax is a very handy tool when it comes to probability distribution prediction problems, but it has its downsides when the number of the nodes grows too large. Let's look at softmax activation in our output layer:

As you can see, every single output is dependent on the other outputs, so in order to compute the derivative with respect to any weight, all the other weights play a role! For a 10000 output size this results in milions of mathematical operations for a single weight update, which is not practical at all!

  • There are various techniques to solve this issue, like using hierarchical softmax or NCE (Noise Contrastive Estimation) . The original Word2vec paper proposes a technique called Negative sampling . Read about this technique and explain its general idea:

$\color{red}{\text{Write your answer here}}$ </br> Recall that the desired ouput was consisting of some few number of 1 values (the words in the context) and lots of 0 values (other irrelevant words), in other words with each training sample, we were trying to make the embedding vectors of our target word and the context words become closer while making our target embedding and all irrelevant word embeddings become less similar. This is actually the main issue, because using all irrelevant words is unnecessary, causing soft max activation computations be too heavy. Negative sampling is one of the ways of addressing this problem with just selecting a couple of irrelevant words at random (instead of all). The end result is that for example if cat appears in the context of food , then the vector of food is more similar to the vector of cat than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy) , instead of all other words in language. This makes word2vec much much faster to train.

  • Explain why is it called Negative sampling ? What are these Negative samples?

$\color{red}{\text{Write your answer here}}$ </br> These randomly choosen irrelevant words are called Negative samples and they are called this way because we are trying to seprate their embeddings from our target word's.

1.4 Word2vec in code ¶

There is a very good library called gensim for using word2vec in python. You can train your own word vectors on your own corpora or use available pretrained models. For example the following model is word vectors for a vocabulary of 3 million words and phrases trained on roughly 100 billion words from a Google News dataset with vector length of 300 features:

Lets load this model in python:

As you can see it requires a huge amount of memory!

  • Use gensim library, find the 3 most similar words to each given following target word using similar_by_word method, find all these words embeddings, reduce their dimension to 2 using a dimension reduction algorithm (eg. t-SNE or PCA) and plot the results in a 2d-scatterplot:

You can find the cosine similarity between two word vectors using similarity method:

  • As you can see there is a meaningfull similarity between the word logitech (a provider company of personal computer and mobile peripherals) and the word cat , even though they shouldn't have this much similarity. Explain why do you think this happens? Find more examples for this phenomenon.

$\color{red}{\text{Write your answer here}}$ </br> This phenomenon is one of the most important current research trends in the field of word sense disambiguation. The problem occurs when there are two words with the same spellings but different meanings. For example in this case, the word mouse causes the problem. Since Logitech company is a computer peripherals provider, it is likely to appear in the same context as the word mouse (meaning a computer I/O device). On the other hand the words cat and mouse (meaning an animal) are likely to appear in the same context too. The result is the embeddings of words Logitech and cat being close to eachother because of the word mouse . More examples:

  • It seems that words like criminal and offensive are more similar to the word black rather than white . It is claimed that word2vec model trained on Google News suffers from gender, racial and religious biases. Explain why do you think this happens and find 4 more examples:

$\color{red}{\text{Write your answer here}}$ </br> This happens because any bias in the articles that make up the Word2vec corpus is inevitably captured in the geometry of the vector space. As a matter of fact the model does not learn anything unless we teach it! This type of biases happen because the training set is biased in this way (e.g. the news about dark skinned people doing crime are more covered.

Word vectors have some other cool properties, for example we know the relation between the meanings of the two words "man" and "woman" is similar to the relation between words "king" and "queen". So we expect $e_{queen} - e_{king} = e_{women} - e_{man}$ or $e_{queen} = e_{king} + e_{women} - e_{man}$ .

  • Show whether the above equation holds or not by following these steps:
  • Extract the embedding vectors for these words.
  • Subtract the vector of "man" from vector of "woman" and add the vector of "king"
  • Find the cosine similarity of the resulting vector with the vector for the word "queen"

2. Context representation using a window-based neural network ¶

From the previous section, we saw that word vectors can store a lot of semantic information in themselves. But can we solve an NLP task by just feeding them through a simple neural network? Assume we want to find all named entities in a given sentence (aka Named Entity recognition). For example, In "I bought 300 shares of Apple Corp. in the last year". We want to locate the word "Apple" and categorize it as an Organization entity.

Obviously, a neural network cannot guess the type entirely based on a single word. We need to provide an extra piece of information to help the decision. This piece of information is called "Context" . We can decide if the word Apple is referring to the company or fruit by seeing it in a sentence (context). However, feeding a complete sentence through a network is inefficient as it makes the input layer really big even for a 10-word sentence (10 * 300 = 3000, assuming an embedding size of 300).

To make training such network possible, we make the input only by including K surrounding neighbor words. hence, apple can be easily classified as a company by looking at the context window [ the, apple, corporation ]

assignment 4 word embeddings

In a window-based classifier, every input sentence $X = [\mathbf{x^{(1)}}, ... , \mathbf{x^{(T)}}]$ with a label sequence $Y = [\mathbf{y^{(1)}}, ..., \mathbf{y^{(T)}}]$ is split into $T$ <context window, center word label> data points. We create a context window $\mathbf{w^{(t)}}$ for every token $\mathbf{x^{(t)}}$ in the original sentence by concatenating its k surrounding neighbors: $\mathbf{w^{(t)}} = [\mathbf{x^{(t-k)}}; ...; \mathbf{x^{(t)}}; ...; \mathbf{x^{(t+k)}}]$, therefore our new data point is created as $\langle \mathbf{w^{(t)}} , \mathbf{y^{(t)}} \rangle$.

Having word case information might also help the neural network to find name entities with higher confidence. To incorporate casing, every token $\mathbf{x^{(t)}}$ is augmented with feature vector $\mathbf{c}$ representing such information: $\mathbf{x^{(t)}} = [\mathbf{e^{(t)}};\mathbf{c^{(t)}}]$ where $\mathbf{e^{(t)}}$ is the corresponding embedding.

In this section, we aim to build a window based feedforward neural network on the NER task, and then analyze its limitations through a case study.

Let's import some depencecies.

And define the model's hyperparameters:

2.1 Preprocessing ¶

As discussed earlier, we want to include the word casing information. Here's our desired function to encode the casing detail in d-dimensional vector. Words "Hello", "hello", "HELLO" and "hELLO" have four different casings. Your encoding should support all of them; In other words, the implemented function must return 4 different vectors for these inputs, but the same output for "Bye" and "Hello", "bye" and "hello", "bYe" and "hEllo", etc.

Describe two other features that would help the window-based model to perform better (apart from word casing).

$\color{red}{\text{Write your answer here}}$

  • POS Tags (e.g. Verb, Adj)
  • NER Tag of previous token

CONLL 2003[1] is a classic NER dataset; It has five tags per each word: [PER, ORG, LOC, MISC, O] , where the label O is for words that have no named entities. We use this dataset to train our window-based model. Note that our split is different from the original one.

Download and construct pre-trained embedding matrix using Glove word vectors.

2.2 Implementation ¶

Let's build the model. we recommend Keras functional API. Number of layer as well as their dimensions is totally up to you.

2.3 Training ¶

2.4 analysis ¶.

Now, It's time to analyze the model behavior. Here is an interactive shell that will enable us to explore the model's limitations and capabilities. Note that the sentences should be entered with spaces between tokens, and Use "do n't" instead of "don't".

To further understand and analyze mistakes made by the model, let's see the confusion matrix:

Describe the window-based network modeling limitations by exploring its outputs. You need to support your conclusion by showing us the errors your model makes. You can either use validation set samples or a manually entered sentence to force the model to make an error. Remember to copy and paste input/output from the interactive shell here.

Model knows nothing about previous neighboring word predicted tag. Thus it is unable to correctly guess the label of multi-word named entites

Model cannot look at other parts of the sentence.

  • Model cannot look at the feature. x : New York State University y': LOC ORG ORG ORG

3. BOW Sentence Representation ¶

We have shown arithmetic relations are present in the embedding space. For example $e_{queen} = e_{king} + e_{women} - e_{man}$ . But are they strong enough for building a rich representation of a sentence? Can we classify a sentence according to the mean of its word's embeddings? In this section, we will find the answers to the above questions.

Assume sentence $X = [\mathbf{x^{(1)}}, ..., \mathbf{x^{(N)}}]$ is given, then a sentence representation $\mathbf{R}$ can be calculated as following:

where $e_{x^{(i)}}$ is an embedding vector for the token $x^{(i)}$.

assignment 4 word embeddings

Having such a simple model will enable us to analyze and understand its capabilities more easily. In addition, we will try one of the state-of-the-art text processing tools, called Flair, which can be run on GPUs. The task is text classification on the AG News corpus, which consists of news articles from more than 2000 news sources. Our split has 110K samples for the training and 10k for the validation set. Dataset examples are labeled with 4 major labels: {World, Sports, Business, Sci/Tech}

3.1 Preprocessing ¶

Often, datasets in NLP come with unprocessed sentences. As a deep learning expert, you should be familiar with popular text processing tools such as NLTK, Spacy, Stanford CoreNLP, and Flair. Generally, text pre-processing in deep learning includes Tokenization, Vocabulary creation, and Padding. But here we want to do one more step, NER replacement. Basically, we want to replace named entities with their corresponding tags. For example "George Washington went to New York" will be converted to "\ went to \ "

The purpose of this step is to reduce the size of vocabulary and support more words. This strategy is proved to be most beneficial when our dataset contains a large number of named entities, e.g. News dataset.

Most pre-processing parts are implemented for you. You only need to fill the following function. Be sure to read the Flair documentations first.

Test your implementation:

Define model's hyperparameters

Process the entire corpus. It will approximately take 50 minutes. Please be patient. You may want to go for the next sections.

Create the embedding matrix

3.2 Implementation ¶

Let's build the model. As always Keras functional API is recommended. Numeber of layer as well as their dimensionality is totally up to you.

3.3 Training ¶

3.4 analysis ¶.

Same as the previous section, an interactive shell is provided. You can enter an input sequence to get the predicted label. The preprocessing functions will do the tokenization, thus don't worry about the spacing.

It is always helpful to see the confusion matrix:

Obviously, this is a relatively simple model. Hence it has limited modeling capabilities; Now it's time to find its mistakes. Can you fool the model by feeding a toxic example? Can you see the bag-of-word effect in its behavior? Write down the model limitation, Answers to the above questions, and keep in mind that you need to support each of your thoughts with an input/output example

Here is some finding from our students

bellow we see effect of BOW, its seems that the correct label is business. but by avoiding relations of words and their sequence it made mistake.

Credits: Mohammad hasan Shamgholi

4. RNN Intuition ¶

Up to now, we've investigated window-based neural networks and the bag-of-words model. Given their simple architectures, the representation power of these models mainly relies on the pre-trained embeddings. For example, a window-based model cannot understand the previous token's label which makes it struggle in identifying multi-word entities. While, adding a single word " not " can entirely change the meaning of a sentence, the BoW model is not sensitive to this as it ignores the order and computes the average embedding (in which single words do not play big roles).

In contrast, RNNs read sentences word by word. At each step, the softmax classifier is forced to predict the label not only by using the input word but also using its context information. If we see the context information as a working memory for RNNs, it will be interesting to find what kind of information is stored in them while it parses a sentence.

To visualize an RNN memory, we will train a language model on a huge chunk of text, and use the validation set to analyze its brain. Then, we will watch each context neuron activation to see if it shows a meaningful pattern while it goes through a sentence. The following figure illustrates a random neuron in the memory which captures the concept of line length. It gradually turns off by reach the sentence end. Probably our model uses this neuron to handle "\n" generation.

assignment 4 word embeddings

Here is another neuron which is sensitive when it's inside a quote.

assignment 4 word embeddings

Here, our goal is to find other meaningful patterns in the RNN hidden states. There is an open source library called LSTMVIs which provides pre-trained models and a great visualization tool. First, watch its tutorial and then answer the following questions:

For each model, find at least two meaningful patterns, and support your hypothesis with screenshots of LSTMVis.

1- Character Model (Wall Street Journal)

Here are some patterns found by our students.

  • A neuron which activates on spaces (Credits: Mohsen Tabasi)

assignment 4 word embeddings

This one is activated after seeing "a" and deactivates after reads "of"! (Credits: Mohsen Tabasi)

A combination of neurons which activate on plural nouns with ending "s"

assignment 4 word embeddings

2- Word Model (Wall Street Journal)

A pattern when the model is referring to some kind of porpotion

assignment 4 word embeddings

A set of neurons which get triggered on pronouns

assignment 4 word embeddings

3- Can you spot the difference between a character-based and a word-based language model?

Char-based models have to learn the concept of word in the first place, and then they can go for much more complex pattern such as gender and grammars. However, given the fact that char-based models parse the sentence one character at a time, they can find patterns in words. i.e., They can identify frequent character n-grams in words which can help them to guess the meaning of unknown words.

References ¶

  • Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.
  • Zhang, Zhao, and LeCun, “Character-Level Convolutional Networks for Text Classification.”
  • Stanford CS224d Course
  • “The Unreasonable Effectiveness of Recurrent Neural Networks.” Accessed May 26, 2019. http://karpathy.github.io/2015/05/21/rnn-effectiveness/ .

Run PyTorch locally or get started quickly with one of the supported cloud platforms

Whats new in PyTorch tutorials

Familiarize yourself with PyTorch concepts and modules

Bite-size, ready-to-deploy PyTorch code examples

Master PyTorch basics with our engaging YouTube tutorial series

Learn about the tools and frameworks in the PyTorch Ecosystem

Join the PyTorch developer community to contribute, learn, and get your questions answered

A place to discuss PyTorch code, issues, install, research

Find resources and get questions answered

Award winners announced at this year's PyTorch Conference

Build innovative and privacy-aware AI experiences for edge devices

End-to-end solution for enabling on-device inference capabilities across mobile and edge devices

Explore the documentation for comprehensive guidance on how to use PyTorch

Read the PyTorch Domains documentation to learn more about domain-specific libraries

Catch up on the latest technical news and happenings

Stories from the PyTorch ecosystem

Learn about the latest PyTorch tutorials, new, and more

Learn how our community solves real, everyday machine learning problems with PyTorch

Find events, webinars, and podcasts

Learn more about the PyTorch Foundation

  • Become a Member
  • Tutorials >
  • Word Embeddings: Encoding Lexical Semantics

Click here to download the full example code

Word Embeddings: Encoding Lexical Semantics ¶

Word embeddings are dense vectors of real numbers, one per word in your vocabulary. In NLP, it is almost always the case that your features are words! But how should you represent a word in a computer? You could store its ascii character representation, but that only tells you what the word is , it doesn’t say much about what it means (you might be able to derive its part of speech from its affixes, or properties from its capitalization, but not much). Even more, in what sense could you combine these representations? We often want dense outputs from our neural networks, where the inputs are \(|V|\) dimensional, where \(V\) is our vocabulary, but often the outputs are only a few dimensional (if we are only predicting a handful of labels, for instance). How do we get from a massive dimensional space to a smaller dimensional space?

How about instead of ascii representations, we use a one-hot encoding? That is, we represent the word \(w\) by

where the 1 is in a location unique to \(w\) . Any other word will have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how huge it is. It basically treats all words as independent entities with no relation to each other. What we really want is some notion of similarity between words. Why? Let’s see an example.

Suppose we are building a language model. Suppose we have seen the sentences

The mathematician ran to the store.

The physicist ran to the store.

The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before seen in our training data:

The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn’t it be much better if we could use the following two facts:

We have seen mathematician and physicist in the same role in a sentence. Somehow they have a semantic relation.

We have seen mathematician in the same role in this new unseen sentence as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen sentence? This is what we mean by a notion of similarity: we mean semantic similarity , not simply having similar orthographic representations. It is a technique to combat the sparsity of linguistic data, by connecting the dots between what we have seen and what we haven’t. This example of course relies on a fundamental linguistic assumption: that words appearing in similar contexts are related to each other semantically. This is called the distributional hypothesis .

Getting Dense Word Embeddings ¶

How can we solve this problem? That is, how could we actually encode semantic similarity in words? Maybe we think up some semantic attributes. For example, we see that both mathematicians and physicists can run, so maybe we give these words a high score for the “is able to run” semantic attribute. Think of some other attributes, and imagine what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector, like this:

Then we can get a measure of similarity between these words by doing:

Although it is more common to normalize by the lengths:

Where \(\phi\) is the angle between the two vectors. That way, extremely similar words (words whose embeddings point in the same direction) will have similarity 1. Extremely dissimilar words should have similarity -1.

You can think of the sparse one-hot vectors from the beginning of this section as a special case of these new vectors we have defined, where each word basically has similarity 0, and we gave each word some unique semantic attribute. These new vectors are dense , which is to say their entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of different semantic attributes that might be relevant to determining similarity, and how on earth would you set the values of the different attributes? Central to the idea of deep learning is that the neural network learns representations of the features, rather than requiring the programmer to design them herself. So why not just let the word embeddings be parameters in our model, and then be updated during training? This is exactly what we will do. We will have some latent semantic attributes that the network can, in principle, learn. Note that the word embeddings will probably not be interpretable. That is, although with our hand-crafted vectors above we can see that mathematicians and physicists are similar in that they both like coffee, if we allow a neural network to learn the embeddings and see that both mathematicians and physicists have a large value in the second dimension, it is not clear what that means. They are similar in some latent semantic dimension, but this probably has no interpretation to us.

In summary, word embeddings are a representation of the *semantics* of a word, efficiently encoding semantic information that might be relevant to the task at hand . You can embed other things too: part of speech tags, parse trees, anything! The idea of feature embeddings is central to the field.

Word Embeddings in Pytorch ¶

Before we get to a worked example and an exercise, a few quick notes about how to use embeddings in Pytorch and in deep learning programming in general. Similar to how we defined a unique index for each word when making one-hot vectors, we also need to define an index for each word when using embeddings. These will be keys into a lookup table. That is, embeddings are stored as a \(|V| \times D\) matrix, where \(D\) is the dimensionality of the embeddings, such that the word assigned index \(i\) has its embedding stored in the \(i\) ’th row of the matrix. In all of my code, the mapping from words to indices is a dictionary named word_to_ix.

The module that allows you to use embeddings is torch.nn.Embedding, which takes two arguments: the vocabulary size, and the dimensionality of the embeddings.

To index into this table, you must use torch.LongTensor (since the indices are integers, not floats).

An Example: N-Gram Language Modeling ¶

Recall that in an n-gram language model, given a sequence of words \(w\) , we want to compute

Where \(w_i\) is the ith word of the sequence.

In this example, we will compute the loss function on some training examples and update the parameters with backpropagation.

Exercise: Computing Word Embeddings: Continuous Bag-of-Words ¶

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic. Typically, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings . It almost always helps performance a couple of percent.

The CBOW model is as follows. Given a target word \(w_i\) and an \(N\) context window on each side, \(w_{i-1}, \dots, w_{i-N}\) and \(w_{i+1}, \dots, w_{i+N}\) , referring to all context words collectively as \(C\) , CBOW tries to minimize

where \(q_w\) is the embedding for word \(w\) .

Implement this model in Pytorch by filling in the class below. Some tips:

Think about which parameters you need to define.

Make sure you know what shape each operation expects. Use .view() if you need to reshape.

Total running time of the script: ( 0 minutes 0.755 seconds)

Download Python source code: word_embeddings_tutorial.py

Download Jupyter notebook: word_embeddings_tutorial.ipynb

Gallery generated by Sphinx-Gallery

  • Getting Dense Word Embeddings
  • Word Embeddings in Pytorch
  • An Example: N-Gram Language Modeling
  • Exercise: Computing Word Embeddings: Continuous Bag-of-Words

assignment 4 word embeddings

Access comprehensive developer documentation for PyTorch

Get in-depth tutorials for beginners and advanced developers

Find development resources and get your questions answered

To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: Cookies Policy .

  • Get Started
  • Learn the Basics
  • PyTorch Recipes
  • Introduction to PyTorch - YouTube Series
  • Developer Resources
  • Contributor Awards - 2023
  • About PyTorch Edge
  • PyTorch Domains
  • Blog & News
  • PyTorch Blog
  • Community Blog
  • Community Stories
  • PyTorch Foundation
  • Governing Board

assignment 4 word embeddings

CoCalc’s goal is to provide the best real-time collaborative environment for Jupyter Notebooks , LaTeX documents , and SageMath , scalable from individual use to large groups and classes.

Assignment 4: word embeddings.

Welcome to the fourth (and last) programming assignment of Course 2!

In this assignment, you will practice how to compute word embeddings and use them for sentiment analysis.

To implement sentiment analysis, you can go beyond counting the number of positive words and negative words.

You can find a way to represent each word numerically, by a vector.

The vector could then represent syntactic (i.e. parts of speech) and semantic (i.e. meaning) structures.

In this assignment, you will explore a classic way of generating word embeddings or representations.

You will implement a famous model called the continuous bag of words (CBOW) model.

By completing this assignment you will:

Train word vectors from scratch.

Learn how to create batches of data.

Understand how backpropagation works.

Plot and visualize your learned word vectors.

Knowing how to train these models will give you a better understanding of word vectors, which are building blocks to many applications in natural language processing.

1 The Continuous bag of words model

2 Training the Model

2.0 Initialize the model

Exercise 01

2.1 Softmax Function

Exercise 02

2.2 Forward Propagation

Exercise 03

2.3 Cost Function

2.4 Backproagation

Exercise 04

2.5 Gradient Descent

Exercise 05

3 Visualizing the word vectors

1. The Continuous bag of words model

Let's take a look at the following sentence:

'I am happy because I am learning' .

In continuous bag of words (CBOW) modeling, we try to predict the center word given a few context words (the words around the center word).

For example, if you were to choose a context half-size of say C = 2 C = 2 C = 2 , then you would try to predict the word happy given the context that includes 2 words before and 2 words after the center word:

C C C words before: [I, am]
C C C words after: [because, I]

In other words:

The structure of your model will look like this:

alternate text

Where x ˉ \bar x x ˉ is the average of all the one hot vectors of the context words.

alternate text

Once you have encoded all the context words, you can use x ˉ \bar x x ˉ as the input to your model.

The architecture you will be implementing is as follows:

Mapping words to indices and indices to words

We provide a helper function to create a dictionary that maps words to indices and indices to words.

Initializing the model

You will now initialize two matrices and two vectors.

The first matrix ( W 1 W_1 W 1 ​ ) is of dimension N × V N \times V N × V , where V V V is the number of words in your vocabulary and N N N is the dimension of your word vector.

The second matrix ( W 2 W_2 W 2 ​ ) is of dimension V × N V \times N V × N .

Vector b 1 b_1 b 1 ​ has dimensions N × 1 N\times 1 N × 1

Vector b 2 b_2 b 2 ​ has dimensions V × 1 V\times 1 V × 1 .

b 1 b_1 b 1 ​ and b 2 b_2 b 2 ​ are the bias vectors of the linear layers from matrices W 1 W_1 W 1 ​ and W 2 W_2 W 2 ​ .

The overall structure of the model will look as in Figure 1, but at this stage we are just initializing the parameters.

Please use numpy.random.rand to generate matrices that are initialized with random values from a uniform distribution, ranging between 0 and 1.

Note: In the next cell you will encounter a random seed. Please DO NOT modify this seed so your solution can be tested correctly.

Expected Output

2.1 softmax.

Before we can start training the model, we need to implement the softmax function as defined in equation 5:

Array indexing in code starts at 0.

V V V is the number of words in the vocabulary (which is also the number of rows of z z z ).

i i i goes from 0 to |V| - 1.

Instructions : Implement the softmax function below.

Assume that the input z z z to softmax is a 2D array

Each training example is represented by a column of shape (V, 1) in this 2D array.

There may be more than one column, in the 2D array, because you can put in a batch of examples to increase efficiency. Let's call the batch size lowercase m m m , so the z z z array has shape (V, m)

When taking the sum from i = 1 ⋯ V − 1 i=1 \cdots V-1 i = 1 ⋯ V − 1 , take the sum for each column (each example) separately.

numpy.sum (set the axis so that you take the sum of each column in z)

Expected Ouput

2.2 forward propagation.

Implement the forward propagation z z z according to equations (1) to (3).

For that, you will use as activation the Rectified Linear Unit (ReLU) given by:

  • You can use numpy.maximum(x1,x2) to get the maximum of two values
  • Use numpy.dot(A,B) to matrix multiply A and B

Expected output

2.3 cost function.

We have implemented the cross-entropy cost function for you.

2.4 Training the Model - Backpropagation

Now that you have understood how the CBOW model works, you will train it. You created a function for the forward propagation. Now you will implement a function that computes the gradients to backpropagate the errors.

Gradient Descent

Now that you have implemented a function to compute the gradients, you will implement batch gradient descent over your training set.

Hint: For that, you will use initialize_model and the back_prop functions which you just created (and the compute_cost function). You can also use the provided get_batches helper function:

for x, y in get_batches(data, word2Ind, V, C, batch_size):

Also: print the cost after each batch is processed (use batch size = 128)

Your numbers may differ a bit depending on which version of Python you're using.

3.0 Visualizing the word vectors

In this part you will visualize the word vectors trained using the function you just coded above.

You can see that man and king are next to each other. However, we have to be careful with the interpretation of this projected word vectors, since the PCA depends on the projection -- as shown in the following illustration.

  • Español – América Latina
  • Português – Brasil
  • Tiếng Việt

Word embeddings

This tutorial contains an introduction to word embeddings. You will train your own word embeddings using a simple Keras model for a sentiment classification task, and then visualize them in the Embedding Projector (shown in the image below).

Screenshot of the embedding projector

Representing text as numbers

Machine learning models take vectors (arrays of numbers) as input. When working with text, the first thing you must do is come up with a strategy to convert strings to numbers (or to "vectorize" the text) before feeding it to the model. In this section, you will look at three strategies for doing so.

One-hot encodings

As a first idea, you might "one-hot" encode each word in your vocabulary. Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, you will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word. This approach is shown in the following diagram.

Diagram of one-hot encodings

To create a vector that contains the encoding of the sentence, you could then concatenate the one-hot vectors for each word.

Encode each word with a unique number

A second approach you might try is to encode each word using a unique number. Continuing the example above, you could assign 1 to "cat", 2 to "mat", and so on. You could then encode the sentence "The cat sat on the mat" as a dense vector like [5, 1, 4, 3, 5, 2]. This approach is efficient. Instead of a sparse vector, you now have a dense one (where all elements are full).

There are two downsides to this approach, however:

The integer-encoding is arbitrary (it does not capture any relationship between words).

An integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful.

Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding manually, they are trainable parameters (weights learned by the model during training, in the same way a model learns weights for a dense layer). It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn.

Diagram of an embedding

Above is a diagram for a word embedding. Each word is represented as a 4-dimensional vector of floating point values. Another way to think of an embedding is as "lookup table". After these weights have been learned, you can encode each word by looking up the dense vector it corresponds to in the table.

Download the IMDb Dataset

You will use the Large Movie Review Dataset through the tutorial. You will train a sentiment classifier model on this dataset and in the process learn embeddings from scratch. To read more about loading a dataset from scratch, see the Loading text tutorial .

Download the dataset using Keras file utility and take a look at the directories.

Take a look at the train/ directory. It has pos and neg folders with movie reviews labelled as positive and negative respectively. You will use reviews from pos and neg folders to train a binary classification model.

The train directory also has additional folders which should be removed before creating training dataset.

Next, create a tf.data.Dataset using tf.keras.utils.text_dataset_from_directory . You can read more about using this utility in this text classification tutorial .

Use the train directory to create both train and validation datasets with a split of 20% for validation.

Take a look at a few movie reviews and their labels (1: positive, 0: negative) from the train dataset.

Configure the dataset for performance

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

.cache() keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.

.prefetch() overlaps data preprocessing and model execution while training.

You can learn more about both methods, as well as how to cache data to disk in the data performance guide .

Using the Embedding layer

Keras makes it easy to use word embeddings. Take a look at the Embedding layer.

The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.

When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on).

If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape (samples, sequence_length) , where each entry is a sequence of integers. It can embed sequences of variable lengths. You could feed into the embedding layer above batches with shapes (32, 10) (batch of 32 sequences of length 10) or (64, 15) (batch of 64 sequences of length 15).

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a (2, 3) input batch and the output is (2, 3, N)

When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality) . To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling because it's the simplest. The Text Classification with an RNN tutorial is a good next step.

Text preprocessing

Next, define the dataset preprocessing steps required for your sentiment classification model. Initialize a TextVectorization layer with the desired parameters to vectorize movie reviews. You can learn more about using this layer in the Text Classification tutorial.

Create a classification model

Use the Keras Sequential API to define the sentiment classification model. In this case it is a "Continuous bag of words" style model.

  • The TextVectorization layer transforms strings into vocabulary indices. You have already initialized vectorize_layer as a TextVectorization layer and built its vocabulary by calling adapt on text_ds . Now vectorize_layer can be used as the first layer of your end-to-end classification model, feeding transformed strings into the Embedding layer.

The Embedding layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding) .

The GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.

The fixed-length output vector is piped through a fully-connected ( Dense ) layer with 16 hidden units.

The last layer is densely connected with a single output node.

Compile and train the model

You will use TensorBoard to visualize metrics including loss and accuracy. Create a tf.keras.callbacks.TensorBoard .

Compile and train the model using the Adam optimizer and BinaryCrossentropy loss.

With this approach the model reaches a validation accuracy of around 78% (note that the model is overfitting since training accuracy is higher).

You can look into the model summary to learn more about each layer of the model.

Visualize the model metrics in TensorBoard.

embeddings_classifier_accuracy.png

Retrieve the trained word embeddings and save them to disk

Next, retrieve the word embeddings learned during training. The embeddings are weights of the Embedding layer in the model. The weights matrix is of shape (vocab_size, embedding_dimension) .

Obtain the weights from the model using get_layer() and get_weights() . The get_vocabulary() function provides the vocabulary to build a metadata file with one token per line.

Write the weights to disk. To use the Embedding Projector , you will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).

If you are running this tutorial in Colaboratory , you can use the following snippet to download these files to your local machine (or use the file browser, View -> Table of contents -> File browser ).

Visualize the embeddings

To visualize the embeddings, upload them to the embedding projector.

Open the Embedding Projector (this can also run in a local TensorBoard instance).

Click on "Load data".

Upload the two files you created above: vecs.tsv and meta.tsv .

The embeddings you have trained will now be displayed. You can search for words to find their closest neighbors. For example, try searching for "beautiful". You may see neighbors like "wonderful".

This tutorial has shown you how to train and visualize word embeddings from scratch on a small dataset.

To train word embeddings using Word2Vec algorithm, try the Word2Vec tutorial.

To learn more about advanced text processing, read the Transformer model for language understanding .

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2023-05-27 UTC.

  • Get Started
  • Tutorials >
  • Deep Learning for NLP with Pytorch >
  • Word Embeddings: Encoding Lexical Semantics

Click here to download the full example code

Word Embeddings: Encoding Lexical Semantics ¶

Word embeddings are dense vectors of real numbers, one per word in your vocabulary. In NLP, it is almost always the case that your features are words! But how should you represent a word in a computer? You could store its ascii character representation, but that only tells you what the word is , it doesn’t say much about what it means (you might be able to derive its part of speech from its affixes, or properties from its capitalization, but not much). Even more, in what sense could you combine these representations? We often want dense outputs from our neural networks, where the inputs are \(|V|\) dimensional, where \(V\) is our vocabulary, but often the outputs are only a few dimensional (if we are only predicting a handful of labels, for instance). How do we get from a massive dimensional space to a smaller dimensional space?

How about instead of ascii representations, we use a one-hot encoding? That is, we represent the word \(w\) by

where the 1 is in a location unique to \(w\) . Any other word will have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how huge it is. It basically treats all words as independent entities with no relation to each other. What we really want is some notion of similarity between words. Why? Let’s see an example.

Suppose we are building a language model. Suppose we have seen the sentences

  • The mathematician ran to the store.
  • The physicist ran to the store.
  • The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before seen in our training data:

  • The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn’t it be much better if we could use the following two facts:

  • We have seen mathematician and physicist in the same role in a sentence. Somehow they have a semantic relation.
  • We have seen mathematician in the same role in this new unseen sentence as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen sentence? This is what we mean by a notion of similarity: we mean semantic similarity , not simply having similar orthographic representations. It is a technique to combat the sparsity of linguistic data, by connecting the dots between what we have seen and what we haven’t. This example of course relies on a fundamental linguistic assumption: that words appearing in similar contexts are related to each other semantically. This is called the distributional hypothesis .

Getting Dense Word Embeddings ¶

How can we solve this problem? That is, how could we actually encode semantic similarity in words? Maybe we think up some semantic attributes. For example, we see that both mathematicians and physicists can run, so maybe we give these words a high score for the “is able to run” semantic attribute. Think of some other attributes, and imagine what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector, like this:

Then we can get a measure of similarity between these words by doing:

Although it is more common to normalize by the lengths:

Where \(\phi\) is the angle between the two vectors. That way, extremely similar words (words whose embeddings point in the same direction) will have similarity 1. Extremely dissimilar words should have similarity -1.

You can think of the sparse one-hot vectors from the beginning of this section as a special case of these new vectors we have defined, where each word basically has similarity 0, and we gave each word some unique semantic attribute. These new vectors are dense , which is to say their entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of different semantic attributes that might be relevant to determining similarity, and how on earth would you set the values of the different attributes? Central to the idea of deep learning is that the neural network learns representations of the features, rather than requiring the programmer to design them herself. So why not just let the word embeddings be parameters in our model, and then be updated during training? This is exactly what we will do. We will have some latent semantic attributes that the network can, in principle, learn. Note that the word embeddings will probably not be interpretable. That is, although with our hand-crafted vectors above we can see that mathematicians and physicists are similar in that they both like coffee, if we allow a neural network to learn the embeddings and see that both mathematicians and physicists have a large value in the second dimension, it is not clear what that means. They are similar in some latent semantic dimension, but this probably has no interpretation to us.

In summary, word embeddings are a representation of the *semantics* of a word, efficiently encoding semantic information that might be relevant to the task at hand . You can embed other things too: part of speech tags, parse trees, anything! The idea of feature embeddings is central to the field.

Word Embeddings in Pytorch ¶

Before we get to a worked example and an exercise, a few quick notes about how to use embeddings in Pytorch and in deep learning programming in general. Similar to how we defined a unique index for each word when making one-hot vectors, we also need to define an index for each word when using embeddings. These will be keys into a lookup table. That is, embeddings are stored as a \(|V| \times D\) matrix, where \(D\) is the dimensionality of the embeddings, such that the word assigned index \(i\) has its embedding stored in the \(i\) ’th row of the matrix. In all of my code, the mapping from words to indices is a dictionary named word_to_ix.

The module that allows you to use embeddings is torch.nn.Embedding, which takes two arguments: the vocabulary size, and the dimensionality of the embeddings.

To index into this table, you must use torch.LongTensor (since the indices are integers, not floats).

An Example: N-Gram Language Modeling ¶

Recall that in an n-gram language model, given a sequence of words \(w\) , we want to compute

Where \(w_i\) is the ith word of the sequence.

In this example, we will compute the loss function on some training examples and update the parameters with backpropagation.

Exercise: Computing Word Embeddings: Continuous Bag-of-Words ¶

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic. Typcially, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings . It almost always helps performance a couple of percent.

The CBOW model is as follows. Given a target word \(w_i\) and an \(N\) context window on each side, \(w_{i-1}, \dots, w_{i-N}\) and \(w_{i+1}, \dots, w_{i+N}\) , referring to all context words collectively as \(C\) , CBOW tries to minimize

where \(q_w\) is the embedding for word \(w\) .

Implement this model in Pytorch by filling in the class below. Some tips:

  • Think about which parameters you need to define.
  • Make sure you know what shape each operation expects. Use .view() if you need to reshape.

Total running time of the script: ( 0 minutes 0.558 seconds)

Gallery generated by Sphinx-Gallery

  • Getting Dense Word Embeddings
  • Word Embeddings in Pytorch
  • An Example: N-Gram Language Modeling
  • Exercise: Computing Word Embeddings: Continuous Bag-of-Words

Access comprehensive developer documentation for PyTorch

Get in-depth tutorials for beginners and advanced developers

Find development resources and get your questions answered

logo

Assignment 4 - Naive Machine Translation and LSH

Assignment 4 - naive machine translation and lsh #.

You will now implement your first machine translation system and then you will see how locality sensitive hashing works. Let’s get started by importing the required functions!

If you are running this notebook in your local computer, don’t forget to download the twitter samples and stopwords from nltk.

Important Note on Submission to the AutoGrader #

Before submitting your assignment to the AutoGrader, please make sure you are not doing the following:

You have not added any extra print statement(s) in the assignment.

You have not added any extra code cell(s) in the assignment.

You have not changed any of the function parameters.

You are not using any global variables inside your graded exercises. Unless specifically instructed to do so, please refrain from it and use the local variables instead.

You are not changing the assignment code where it is not required, like creating extra variables.

If you do any of the following, you will get something like, Grader not found (or similarly unexpected) error upon submitting your assignment. Before asking for help/debugging the errors in your assignment, check for these first. If this is the case, and you don’t remember the changes you have made, you can get a fresh copy of the assignment by following these instructions .

This assignment covers the folowing topics: #

1. The word embeddings data for English and French words

1.1 Generate embedding and transform matrices

2. Translations

2.1 Translation as linear transformation of embeddings

2.2 Testing the translation

3. LSH and document search

3.1 Getting the document embeddings

3.2 Looking up the tweets

3.3 Finding the most similar tweets with LSH

3.4 Getting the hash number for a vector

3.5 Creating a hash table

Exercise 10

3.6 Creating all hash tables

Exercise 11

1. The word embeddings data for English and French words #

Write a program that translates English to French.

The full dataset for English embeddings is about 3.64 gigabytes, and the French embeddings are about 629 megabytes. To prevent the Coursera workspace from crashing, we’ve extracted a subset of the embeddings for the words that you’ll use in this assignment.

The subset of data #

To do the assignment on the Coursera workspace, we’ll use the subset of word embeddings.

Look at the data #

en_embeddings_subset: the key is an English word, and the value is a 300 dimensional array, which is the embedding for that word.

fr_embeddings_subset: the key is a French word, and the value is a 300 dimensional array, which is the embedding for that word.

Load two dictionaries mapping the English to French words #

A training dictionary

and a testing dictionary.

Looking at the English French dictionary #

en_fr_train is a dictionary where the key is the English word and the value is the French translation of that English word.

en_fr_test is similar to en_fr_train , but is a test set. We won’t look at it until we get to testing.

1.1 Generate embedding and transform matrices #

Exercise 01: translating english dictionary to french by using embeddings #.

You will now implement a function get_matrices , which takes the loaded data and returns matrices X and Y .

en_fr : English to French dictionary

en_embeddings : English to embeddings dictionary

fr_embeddings : French to embeddings dictionary

Matrix X and matrix Y , where each row in X is the word embedding for an english word, and the same row in Y is the word embedding for the French version of that English word.

alternate text

Use the en_fr dictionary to ensure that the ith row in the X matrix corresponds to the ith row in the Y matrix.

Instructions : Complete the function get_matrices() :

Iterate over English words in en_fr dictionary.

Check if the word have both English and French embedding.

  • Sets are useful data structures that can be used to check if an item is a member of a group.
  • You can get words which are embedded into the language by using keys method.
  • Keep vectors in `X` and `Y` sorted in list. You can use np.vstack() to merge them into the numpy matrix.
  • numpy.vstack stacks the items in a list as rows in a matrix.

Now we will use function get_matrices() to obtain sets X_train and Y_train of English and French word embeddings into the corresponding vector space models.

2. Translations #

alternate text

Write a program that translates English words to French words using word embeddings and vector space models.

2.1 Translation as linear transformation of embeddings #

Given dictionaries of English and French word embeddings you will create a transformation matrix R

Given an English word embedding, \(\mathbf{e}\) , you can multiply \(\mathbf{eR}\) to get a new word embedding \(\mathbf{f}\) .

Both \(\mathbf{e}\) and \(\mathbf{f}\) are row vectors .

You can then compute the nearest neighbors to f in the french embeddings and recommend the word that is most similar to the transformed word embedding.

Describing translation as the minimization problem #

Find a matrix R that minimizes the following equation.

Frobenius norm #

The Frobenius norm of a matrix \(A\) (assuming it is of dimension \(m,n\) ) is defined as the square root of the sum of the absolute squares of its elements:

Actual loss function #

In the real world applications, the Frobenius norm loss:

is often replaced by it’s squared value divided by \(m\) :

where \(m\) is the number of examples (rows in \(\mathbf{X}\) ).

The same R is found when using this loss function versus the original Frobenius norm.

The reason for taking the square is that it’s easier to compute the gradient of the squared Frobenius.

The reason for dividing by \(m\) is that we’re more interested in the average loss per embedding than the loss for the entire training set.

The loss for all training set increases with more words (training examples), so taking the average helps us to track the average loss regardless of the size of the training set.

[Optional] Detailed explanation why we use norm squared instead of the norm: #

Exercise 02: implementing translation mechanism described in this section. #, step 1: computing the loss #.

The loss function will be squared Frobenoius norm of the difference between matrix and its approximation, divided by the number of training examples \(m\) .

Its formula is: $ \( L(X, Y, R)=\frac{1}{m}\sum_{i=1}^{m} \sum_{j=1}^{n}\left( a_{i j} \right)^{2}\) $

where \(a_{i j}\) is value in \(i\) th row and \(j\) th column of the matrix \(\mathbf{XR}-\mathbf{Y}\) .

Instructions: complete the compute_loss() function #

Compute the approximation of Y by matrix multiplying X and R

Compute difference XR - Y

Compute the squared Frobenius norm of the difference and divide it by \(m\) .

  • Useful functions: Numpy dot , Numpy sum , Numpy square , Numpy norm
  • Be careful about which operation is elementwise and which operation is a matrix multiplication.
  • Try to use matrix operations instead of the numpy norm function. If you choose to use norm function, take care of extra arguments and that it's returning loss squared, and not the loss itself.

Expected output:

Exercise 03 #

Step 2: computing the gradient of loss in respect to transform matrix r #.

Calculate the gradient of the loss with respect to transform matrix R .

The gradient is a matrix that encodes how much a small change in R affect the change in the loss function.

The gradient gives us the direction in which we should decrease R to minimize the loss.

\(m\) is the number of training examples (number of rows in \(X\) ).

The formula for the gradient of the loss function \(𝐿(𝑋,𝑌,𝑅)\) is:

Instructions : Complete the compute_gradient function below.

  • Transposing in numpy
  • Finding out the dimensions of matrices in numpy
  • Remember to use numpy.dot for matrix multiplication

Step 3: Finding the optimal R with gradient descent algorithm #

Gradient descent #.

Gradient descent is an iterative algorithm which is used in searching for the optimum of the function.

Earlier, we’ve mentioned that the gradient of the loss with respect to the matrix encodes how much a tiny change in some coordinate of that matrix affect the change of loss function.

Gradient descent uses that information to iteratively change matrix R until we reach a point where the loss is minimized.

Training with a fixed number of iterations #

Most of the time we iterate for a fixed number of training steps rather than iterating until the loss falls below a threshold.

OPTIONAL: explanation for fixed number of iterations #

  • You cannot rely on training loss getting low -- what you really want is the validation loss to go down, or validation accuracy to go up. And indeed - in some cases people train until validation accuracy reaches a threshold, or -- commonly known as "early stopping" -- until the validation accuracy starts to go down, which is a sign of over-fitting.
  • Why not always do "early stopping"? Well, mostly because well-regularized models on larger data-sets never stop improving. Especially in NLP, you can often continue training for months and the model will continue getting slightly and slightly better. This is also the reason why it's hard to just stop at a threshold -- unless there's an external customer setting the threshold, why stop, where do you put the threshold?
  • Stopping after a certain number of steps has the advantage that you know how long your training will take - so you can keep some sanity and not train for months. You can then try to get the best performance within this time budget. Another advantage is that you can fix your learning rate schedule -- e.g., lower the learning rate at 10% before finish, and then again more at 1% before finishing. Such learning rate schedules help a lot, but are harder to do if you don't know how long you're training.

Pseudocode:

Calculate gradient \(g\) of the loss with respect to the matrix \(R\) .

Update \(R\) with the formula: $ \(R_{\text{new}}= R_{\text{old}}-\alpha g\) $

Where \(\alpha\) is the learning rate, which is a scalar.

Learning rate #

The learning rate or “step size” \(\alpha\) is a coefficient which decides how much we want to change \(R\) in each step.

If we change \(R\) too much, we could skip the optimum by taking too large of a step.

If we make only small changes to \(R\) , we will need many steps to reach the optimum.

Learning rate \(\alpha\) is used to control those changes.

Values of \(\alpha\) are chosen depending on the problem, and we’ll use learning_rate \(=0.0003\) as the default value for our algorithm.

Exercise 04 #

Instructions: implement align_embeddings() #.

  • Use the 'compute_gradient()' function to get the gradient in each step

Expected Output:

Calculate transformation matrix R #

Using those the training set, find the transformation matrix \(\mathbf{R}\) by calling the function align_embeddings() .

NOTE: The code cell below will take a few minutes to fully execute (~3 mins)

Expected Output #

2.2 testing the translation #, k-nearest neighbors algorithm #.

k-Nearest neighbors algorithm

k-NN is a method which takes a vector as input and finds the other vectors in the dataset that are closest to it.

The ‘k’ is the number of “nearest neighbors” to find (e.g. k=2 finds the closest two neighbors).

Searching for the translation embedding #

Since we’re approximating the translation function from English to French embeddings by a linear transformation matrix \(\mathbf{R}\) , most of the time we won’t get the exact embedding of a French word when we transform embedding \(\mathbf{e}\) of some particular English word into the French embedding space.

This is where \(k\) -NN becomes really useful! By using \(1\) -NN with \(\mathbf{eR}\) as input, we can search for an embedding \(\mathbf{f}\) (as a row) in the matrix \(\mathbf{Y}\) which is the closest to the transformed vector \(\mathbf{eR}\)

Cosine similarity #

Cosine similarity between vectors \(u\) and \(v\) calculated as the cosine of the angle between them. The formula is

\(\cos(u,v)\) = \(1\) when \(u\) and \(v\) lie on the same line and have the same direction.

\(\cos(u,v)\) is \(-1\) when they have exactly opposite directions.

\(\cos(u,v)\) is \(0\) when the vectors are orthogonal (perpendicular) to each other.

Note: Distance and similarity are pretty much opposite things. #

We can obtain distance metric from cosine similarity, but the cosine similarity can’t be used directly as the distance metric.

When the cosine similarity increases (towards \(1\) ), the “distance” between the two vectors decreases (towards \(0\) ).

We can define the cosine distance between \(u\) and \(v\) as $ \(d_{\text{cos}}(u,v)=1-\cos(u,v)\) $

Exercise 05 : Complete the function nearest_neighbor()

A set of possible nearest neighbors candidates

k nearest neighbors to find.

The distance metric should be based on cosine similarity.

cosine_similarity function is already implemented and imported for you. It’s arguments are two vectors and it returns the cosine of the angle between them.

Iterate over rows in candidates , and save the result of similarities between current row and vector v in a python list. Take care that similarities are in the same order as row vectors of candidates .

Now you can use numpy argsort to sort the indices for the rows of candidates .

  • numpy.argsort sorts values from most negative to most positive (smallest to largest)
  • The candidates that are nearest to 'v' should have the highest cosine similarity
  • To reverse the order of the result of numpy.argsort to get the element with highest cosine similarity as the first element of the array you can use tmp[::-1]. This reverses the order of an array. Then, you can extract the first k elements.

Expected Output :

[[2 0 1]   [1 0 5]   [9 9 9]]

Test your translation and compute its accuracy #

Exercise 06 : Complete the function test_vocabulary which takes in English embedding matrix \(X\) , French embedding matrix \(Y\) and the \(R\) matrix and returns the accuracy of translations from \(X\) to \(Y\) by \(R\) .

Iterate over transformed English word embeddings and check if the closest French word vector belongs to French word that is the actual translation.

Obtain an index of the closest French embedding by using nearest_neighbor (with argument k=1 ), and compare it to the index of the English embedding you have just transformed.

Keep track of the number of times you get the correct translation.

Calculate accuracy as $ \(\text{accuracy}=\frac{\#(\text{correct predictions})}{\#(\text{total predictions})}\) $

Let’s see how is your translation mechanism working on the unseen data:

You managed to translate words from one language to another language without ever seing them with almost 56% accuracy by using some basic linear algebra and learning a mapping of words from one language to another!

3. LSH and document search #

In this part of the assignment, you will implement a more efficient version of k-nearest neighbors using locality sensitive hashing. You will then apply this to document search.

Process the tweets and represent each tweet as a vector (represent a document with a vector embedding).

Use locality sensitive hashing and k nearest neighbors to find tweets that are similar to a given tweet.

3.1 Getting the document embeddings #

Bag-of-words (bow) document models #.

Text documents are sequences of words.

The ordering of words makes a difference. For example, sentences “Apple pie is better than pepperoni pizza.” and “Pepperoni pizza is better than apple pie” have opposite meanings due to the word ordering.

However, for some applications, ignoring the order of words can allow us to train an efficient and still effective model.

This approach is called Bag-of-words document model.

Document embeddings #

Document embedding is created by summing up the embeddings of all words in the document.

If we don’t know the embedding of some word, we can ignore that word.

Exercise 07 : Complete the get_document_embedding() function.

The function get_document_embedding() encodes entire document as a “document” embedding.

It takes in a document (as a string) and a dictionary, en_embeddings

It processes the document, and looks up the corresponding embedding of each word.

It then sums them up and returns the sum of all word vectors of that processed tweet.

  • You can handle missing words easier by using the `get()` method of the python dictionary instead of the bracket notation (i.e. "[ ]"). See more about it here
  • The default value for missing word should be the zero vector. Numpy will broadcast simple 0 scalar into a vector of zeros during the summation.
  • Alternatively, skip the addition if a word is not in the dictonary.
  • You can use your `process_tweet()` function which allows you to process the tweet. The function just takes in a tweet and returns a list of words.

Expected output :

Exercise 08 #

Store all document vectors into a dictionary #.

Now, let’s store all the tweet embeddings into a dictionary. Implement get_document_vecs()

3.2 Looking up the tweets #

Now you have a vector of dimension (m,d) where m is the number of tweets (10,000) and d is the dimension of the embeddings (300). Now you will input a tweet, and use cosine similarity to see which tweet in our corpus is similar to your tweet.

3.3 Finding the most similar tweets with LSH #

You will now implement locality sensitive hashing (LSH) to identify the most similar tweet.

Instead of looking at all 10,000 vectors, you can just search a subset to find its nearest neighbors.

Let’s say your data points are plotted like this:

alternate text

You can divide the vector space into regions and search within one region for nearest neighbors of a given vector.

alternate text

Choosing the number of planes #

Each plane divides the space to \(2\) parts.

So \(n\) planes divide the space into \(2^{n}\) hash buckets.

We want to organize 10,000 document vectors into buckets so that every bucket has about \(~16\) vectors.

For that we need \(\frac{10000}{16}=625\) buckets.

We’re interested in \(n\) , number of planes, so that \(2^{n}= 625\) . Now, we can calculate \(n=\log_{2}625 = 9.29 \approx 10\) .

3.4 Getting the hash number for a vector #

For each vector, we need to get a unique number associated to that vector in order to assign it to a “hash bucket”.

Hyperplanes in vector spaces #

In \(3\) -dimensional vector space, the hyperplane is a regular plane. In \(2\) dimensional vector space, the hyperplane is a line.

Generally, the hyperplane is subspace which has dimension \(1\) lower than the original vector space has.

A hyperplane is uniquely defined by its normal vector.

Normal vector \(n\) of the plane \(\pi\) is the vector to which all vectors in the plane \(\pi\) are orthogonal (perpendicular in \(3\) dimensional case).

Using Hyperplanes to split the vector space #

We can use a hyperplane to split the vector space into \(2\) parts.

All vectors whose dot product with a plane’s normal vector is positive are on one side of the plane.

All vectors whose dot product with the plane’s normal vector is negative are on the other side of the plane.

Encoding hash buckets #

For a vector, we can take its dot product with all the planes, then encode this information to assign the vector to a single hash bucket.

When the vector is pointing to the opposite side of the hyperplane than normal, encode it by 0.

Otherwise, if the vector is on the same side as the normal vector, encode it by 1.

If you calculate the dot product with each plane in the same order for every vector, you’ve encoded each vector’s unique hash ID as a binary number, like [0, 1, 1, … 0].

Exercise 09: Implementing hash buckets #

We’ve initialized hash table hashes for you. It is list of N_UNIVERSES matrices, each describes its own hash table. Each matrix has N_DIMS rows and N_PLANES columns. Every column of that matrix is a N_DIMS -dimensional normal vector for each of N_PLANES hyperplanes which are used for creating buckets of the particular hash table.

Exercise : Your task is to complete the function hash_value_of_vector which places vector v in the correct hash bucket.

First multiply your vector v , with a corresponding plane. This will give you a vector of dimension \((1,\text{N_planes})\) .

You will then convert every element in that vector to 0 or 1.

You create a hash vector by doing the following: if the element is negative, it becomes a 0, otherwise you change it to a 1.

You then compute the unique number for the vector by iterating over N_PLANES

Then you multiply \(2^i\) times the corresponding bit (0 or 1).

You will then store that sum in the variable hash_value .

Intructions: Create a hash for the vector in the function below. Use this formula:

Create the sets of planes #

Create multiple (25) sets of planes (the planes that divide up the region).

You can think of these as 25 separate ways of dividing up the vector space with a different set of planes.

Each element of this list contains a matrix with 300 rows (the word vector have 300 dimensions), and 10 columns (there are 10 planes in each “universe”).

  • numpy.squeeze() removes unused dimensions from an array; for instance, it converts a (10,1) 2D array into a (10,) 1D array

3.5 Creating a hash table #

Exercise 10 #.

Given that you have a unique number for each vector (or tweet), You now want to create a hash table. You need a hash table, so that given a hash_id, you can quickly look up the corresponding vectors. This allows you to reduce your search by a significant amount of time.

alternate text

We have given you the make_hash_table function, which maps the tweet vectors to a bucket and stores the vector there. It returns the hash_table and the id_table . The id_table allows you know which vector in a certain bucket corresponds to what tweet.

  • a dictionary comprehension, similar to a list comprehension, looks like this: `{i:0 for i in range(10)}`, where the key is 'i' and the value is zero for all key-value pairs.

Expected output #

3.6 creating all hash tables #.

You can now hash your vectors and store them in a hash table that would allow you to quickly look up and search for similar vectors. Run the cell below to create the hashes. By doing so, you end up having several tables which have all the vectors. Given a vector, you then identify the buckets in all the tables. You can then iterate over the buckets and consider much fewer vectors. The more buckets you use, the more accurate your lookup will be, but also the longer it will take.

Approximate K-NN #

Exercise 11 #.

Implement approximate K nearest neighbors using locality sensitive hashing, to search for documents that are similar to a given document at the index doc_id .

doc_id is the index into the document list all_tweets .

v is the document vector for the tweet in all_tweets at index doc_id .

planes_l is the list of planes (the global variable created earlier).

k is the number of nearest neighbors to search for.

num_universes_to_use : to save time, we can use fewer than the total number of available universes. By default, it’s set to N_UNIVERSES , which is \(25\) for this assignment.

hash_tables : list with hash tables for each universe.

id_tables : list with id tables for each universe.

The approximate_knn function finds a subset of candidate vectors that are in the same “hash bucket” as the input vector ‘v’. Then it performs the usual k-nearest neighbors search on this subset (instead of searching through all 10,000 tweets).

  • There are many dictionaries used in this function. Try to print out planes_l, hash_tables, id_tables to understand how they are structured, what the keys represent, and what the values contain.
  • To remove an item from a list, use `.remove()`
  • To append to a list, use `.append()`
  • To add to a set, use `.add()`

4 Conclusion #

Congratulations - Now you can look up vectors that are similar to the encoding of your tweet using LSH!

Logo image

Word Embeddings

4. word embeddings #.

Our sessions so far have worked off the idea of document annotation to produce metadata about texts. We’ve used this information for everything from information retrieval tasks (Chapter 2) to predictive classification (Chapter 3). Along the way, we’ve also made some passing discussions about how such annotations work to quantify or identify the semantics of those tasks (our work with POS tags, for example). But what we haven’t yet done is produce a model of semantic meaning ourselves. This is another core task of NLP, and there are several different ways to approach building a statistical representation of tokens’ meanings. The present chapter discusses one of the most popular methods of doing so: word embeddings . Below, we’ll overview what word embeddings are, demonstrate how to build and use them, talk about important considerations regarding bias, and apply all this to a document clustering task.

The corpus we’ll use is Melanie Walsh’s collection of ~380 obituaries from the New York Times . If you participated in our Getting Started with Textual Data series, you’ll be familiar with this corpus: we used it in the context of tf-idf scores . Our return to it here is meant to chime with that discussion, for word embeddings enable us to perform a similar kind of text vectorization. Though, as we’ll discuss, the resultant vectors will be considerably more feature-rich than what we could achieve with tf-idf alone.

Learning objectives

By the end of this chapter, you will be able to:

Explain what word embeddings are

Use gensim to train and load word embeddings models

Identify and analyze word relationships in these models

Recognize how bias can inhere in embeddings

Encode documents with a word embeddings model

4.1. How It Works #

Prior to the advent of Transformer models, word embedding served as a state-of-the-art technique for representing semantic relationships between tokens. The technique was first introduced in 2013, and it spawned a host of different variants that completely flooded the field of NLP until about 2018. In part, word embedding’s popularity stems from the relatively simple intuition behind it, which is known as the distributional hypothesis : “you shall know a word by the company it keeps!” (J.R. Firth). Words that appear in similar contexts, in other words, have similar meanings, and what word embeddings do is represent that context-specific information through a set of features. As a result, similar words share similar data representations, and we can leverage that similarity to explore the semantic space of a corpus, to encode documents with feature-rich data, and more.

If you’re familiar with tf-idf vectors, the underlying data structure of word embeddings is the same: every word is represented by a vector of features. But a key difference lies in the sparsity of the vectors – or, in the case of word embeddings, the lack of sparsity. As we saw in the last chapter, tf-idf vectors can suffer from the curse of dimensionality , something that’s compounded by the fact that such vectors must contain features for every word in corpus, regardless of whether a document has that word. This means tf-idf vectors are highly sparse: they contain many 0s. Word embeddings, on the other hand, do not. They’re what we call dense representations. Each one is a fixed-length, non-sparse vector (of 50-300 dimensions, usually) that is much more information-rich than tf-idf. As a result, embeddings tend to be capable of representing more nuanced relationships between corpus words – a performance improvement that is further boosted by the fact that many of the most popular models had the advantage of being trained on billions and billions of tokens.

The other major difference between these vectors and tf-idf lies in how the former are created. While at root, word embeddings represent token co-occurrence data (just like a document-term matrix), they are the product of millions of guesses made by a neural network. Training this network involves making predictions about a target word, based on that word’s context. We are not going to delve into the math behind these predictions (though this post does); however, it is worth noting that there are two different training set ups for a word embedding model:

Common Bag of Words (CBOW) : given a window of words on either side of a target, the network tries to predict what word the target should be

Skip-grams : the network starts with the word in the middle of a window and picks random words within this window to use as its prediction targets

As you may have noticed, these are just mirrored versions of one another. CBOW starts from context, while skip-gram tries to rebuild context. Regardless, in both cases the network attempts to maximize the likelihood of its predictions, updating its weights accordingly over the course of training. Words that repeatedly appear in similar contexts will help shape thse weights, and in turn the model will associate such words with similar vector representations. If you’d like to see all this in action, Xin Rong has produced a fantastic interactive visualization of how word embedding models learn.

Of course, the other way to understand how word embeddings work is to use them yourself. We’ll move on to doing so now.

4.2. Preliminaries #

Here are the libraries we will use in this chapter.

We also initialize an input directory and load a file manifest.

And finally we’ll load the obituaries. While the past two sessions have required full-text representations of documents, word embeddings work best with bags of words, especially when it comes to doing analysis with them. Accordingly, each of the files in the corpus have already processed by a text cleaning pipeline: they represent the lowercase, stopped, and lemmatized versions of the originals.

With this time, it’s time to move to the model.

4.3. Using an Embeddings Model #

At this point, we are at a crossroads. On the one hand, we could train a word embeddings model using our corpus documents as is. The gensim library offers functionality for this, and it’s a relatively easy operation. On the other, we could use pre-made embeddings, which are usually trained on a more general – and much larger – set of documents. There is a trade-off here:

Training a corpus-specific model will more faithfully represent the token behavior of the texts we’d like to analyze, but these representations could be too specific, especially if the model doesn’t have enough data to train on; the resultant embeddings may be closer to topic models than to word-level semantics

Using pre-made embeddings gives us the benefit of generalization: the vectors will cleave more closely to how we understand language; but such embeddings might a) miss out on certain nuances we’d like to capture, or b) introduce biases into our corpus (more on this below)

In our case, the decision is difficult. When preparing this reader, we (Tyler and Carl) found that a model trained on the obituaries alone did not produce vectors that could fully demonstrate the capabilities of the word embedding technique. The corpus is just a little too specific, and perhaps a little too small. We could’ve used a larger corpus, but doing so would introduce slow-downs in the workshop session. Because of this, we’ve gone with a pre-made model: the Stanford GloVe embeddings (the 200-dimension version). GloVe was trained on billions of tokens, spanning Wikipedia data, newswire articles, even Twitter. More, the model’s developers offer several different dimension sizes, which are helpful for selecting embeddings with the right amount of detail.

That said, going with GloVe introduces its own problems. For one thing, we can’t show you how to train a word embeddings model itself – at least not live. The code to do so, however, is reproduced below:

Another problem has to do with the data GloVe was trained on. It’s so large that we can’t account for all the content, and this becomes particularly detrimental when it comes to bias. Researchers have found that general embeddings models reproduce gender-discriminatory language, even hate speech, by virtue of the fact that they are trained on huge amounts of text data, often without consideration of whether the content of such data is something one would endorse. GloVe is known to be biased in this way. We’ll show an example later on in this chapter and will discuss this in much more detail during our live session, but for now just note that the effects of bias do shape how we represent our corpus, and it’s important to keep an eye out for this when working with the data.

4.3.1. Loading a model #

With all that said, we can move on. Below, we load GloVe embeddings into our workspace using a gensim wrapper.

The KeyedVectors object acts much like a Python dictionary, and you can do certain Python operations directly on it.

4.3.2. Token mappings #

Each token in the model has an associated index. This mapping is accessible via .key_to_index .

If you want the vector representation for a token, use either the key or the index.

Here’s its vector:

Here are some random tokens in the model:

You may find some unexpected tokens in this output. Though it has been ostensibly trained on an English corpus, GloVe contains multilingual text. It also contains lots of noisy tokens, which range from erroneous segmentations (“drummer/percussionist” is one token, for example) to password-like strings and even HTML markup. Depending on your task, you may not notice these tokens, but they do in fact influence the overall shape of the model, and sometimes you’ll find them cropping up when you’re hunting around for similar terms and the like (more on this soon).

4.3.3. Out-of-vocabulary tokens #

While GloVe’s vocabulary sometimes seems too expansive, there are other instances where it’s too restricted.

If the model wasn’t trained on a particular word, it won’t have a corresponding vector for that word either. This is crucial. Because models like GloVe only know what they’ve been trained on, you need to be aware of any potential discrepancies between their vocabularies and your corpus data. If you don’t keep this in mind, sending unseen, or out-of-vocabulary , tokens to GloVe will throw errors in your code.

There are a few ways to handle this problem. The most common is to simply not encode tokens in your corpus that don’t have a corresponding vector in GloVe. Below, we construct two sets for our corpus data. The first contains all tokens in the corpus, while the second tracks which of those tokens are in the model. We identify whether the model has a token using its .has_index_for() method.

Any subsequent code we write will need to reference these sets to determine whether it should encode a token.

While this is what we’ll indeed do below, obviously it isn’t an ideal situation. But it’s one of the consequences of using premade models. There are, however, a few other ways to handle out-of-vocabulary terms. Some models offer special “UNK” tokens, which you could associate with all of your problem tokens. This, at the very least, enables you to have some representation of your data. A more complex approach involves taking the mean embedding of the word vectors surrounding an unknown token; and depending on the model, you can also train it further, adding extra tokens from your domain-specific text. Instructions for this last option are available here in the gensim documentation.

4.4. Word Relationships #

Later on we’ll use GloVe to encode our corpus texts. But before we do, it’s worth demonstrating more generally some of the properties of word vectors. Vector representations of text allow us to perform various mathematical operations on our corpus that approximate semantics. The most common among these operations is finding the cosine similarity between two vectors. Our Getting Started with Textual Data series has a whole chapter on this measure, so if you haven’t encountered it before, we recommend you read that. But in short: cosine similarity measures the difference between vectors’ orientation in a feature space (here, the feature space is comprised of each of the vectors’ 200 dimensions). The closer two vectors are, the more likely they are to share semantic similarities.

4.4.1. Similar tokens #

gensim provides easy access to this measure and other such vector space operations. To find the cosine similarity between the vectors for two words in GloVe, simply use the model’s .similarity() method:

The only difference between the score above and the one that you might produce, say, with scikit-learn ’s cosine similarity implementation is that gensim bounds its values from [-1,1] , whereas the latter uses a [0,1] scale. While in gensim it’s still the case that similar words score closer to 1 , highly dissimilar words will be closer to -1 .

At any rate, we can get the top n most similar words for a word using .most_similar() . The function defaults to 10 entries, but you can change that with the topn parameter. We’ll wrap this in a custom function, since we’ll call it a number of times.

Now we sample some tokens and find their most similar tokens.

It’s also possible to find the least similar word. This is useful to show, because it pressures our idea of what counts as similarity. Mathematical similarity does not always align with concepts like synonyms and antonyms. For example, it’s probably safe to say that the semantic opposite of “good” – that is, its antonym – is “evil.” But in the world of vector spaces, the least similar word to “good” is:

Just noise! Relatively speaking, the vectors for “good” and “evil” are actually quite similar.

How do we make sense of this? Well, it has to do with the way the word embeddings are created. Since embeddings models are ultimately trained on co-occurrence data, words that tend to appear in similar kinds of contexts will be more similar in a mathematical sense than those that don’t.

Keeping this in mind is also important for considerations of bias. Since, in one sense, embeddings reflect the interchangeability between tokens , they will reinforce negative, even harmful patterns in the data (which is to say in culture at large). For example, consider the most similar words for “doctor” and “nurse.” The latter is locked up within gendered language: according to GloVe, a nurse is like a midwife is like a mother.

4.4.2. Concept modeling #

Beyond cosine similarity, there are other word relationships to explore via vector space math. For example, one way of modeling something like a concept is to think about what other concepts comprise it. In other words: what plus what creates a new concept? Could we identify concepts by adding together vectors to create a new vector? Which words would this new vector be closest to in the vector space? Using the .similar_by_vector() method, we can find out.

Not bad! While our target concepts aren’t the most similar words for these synthetic vectors, they’re often in the top-10 most similar results.

4.4.3. Analogies #

Most famously, word embeddings enable quasi-logical reasoning. Though relationships between antonyms and synonyms do not necessarily map to a vector space, certain analogies do – at least under the right circumstances, and with particular training data. The logic here is that we identify a relationship between two words and we subtract one of those words’ vectors from the other. To that new vector we add in a vector for a target word, which forms the analogy. Querying for the word closest to this modified vector should produce a similar relation between the result and the target word as that between the original pair.

Here, we ask: “strong is to stronger what clear is to X?” Ideally, we’d get “clearer.”

“Paris is to France what Berlin is to X?” Answer: “Germany.”

Both of the above produce compelling results, though your mileage may vary. Consider the following: “arm is to hand what leg is to X?” We’d expect “foot.”

Importantly, these results are always going to be specific to the data on which a model was trained. Claims made on the basis of word embeddings that aspire to general linguistic truths would be treading on shaky ground here.

4.5. Document Similarity #

While the above word relationships are relatively abstract (and any such findings therefrom should be couched accordingly), we can ground them with a concrete task. In this final section, we use GloVe embeddings to encode our corpus documents. This involves associating a word vector for each token in an obituary. Of course, GloVe has not been trained on the obituaries, so there may be important differences in token behavior between that model and the corpus; but we can assume that the general nature of GloVe will give us a decent sense of the overall feature space of the corpus. The result will be an enriched representation of each document, the nuances of which may better help us identify things like similarities between obituaries in our corpus.

The other consideration for using GloVe with our specific corpus concerns the out-of-vocabulary words we’ve already discussed. Before we can encode our documents, we need to filter out tokens for which GloVe has no representation. We can do so by referencing the in_glove set we produced above.

4.5.1. Encoding #

Time to encode. This is an easy operation. All we need to do is run the list of document’s tokens directly into the model object and gensim will encode each accordingly. The result will be an (n, 200) array, where n is the number of tokens we passed to the model; each one will have 200 dimensions.

But if we kept this array as is, we’d run into trouble. Matrix operations often require identically shaped representations, so documents with different lengths would be incomparable. To get around this, we take the mean of all the vectors in a document. The result is a 200-dimension vector that stands as a general representation of a document.

Let’s quickly check our work.

4.5.2. Visualizing #

From here, we can use these embeddings for any task that requires feature vectors. For example, let’s plot our documents using t-SNE. First, we reduce the embeddings.

Now we define a function to make our plot. We’ll add some people to look for along as well (in this case, a few baseball players)

../_images/0d4753f1b8d07708cb4b680fcd3f68dd037476f5eb232cfc22d9704346d1810b.png

4.5.3. Clustering #

The document embeddings seem to be partitioned into different clusters. We’ll end by using a hierarchical clusterer to see if we can further specify these clusters. This involves using the AgglomerativeClustering object, which we fit to our embeddings. Hierarchical clustering requires a pre-defined number of clusters. In this case, we use 18.

Now we assign the clusterer’s predicted labels to the visualization data DataFrame and re-plot the results.

../_images/942124837a2c19b5055803ae1bf5a8b97cead57a8e3bdafae4bea01dc86a3b54.png

These clusters seem to be both detailed and nicely partitioned, bracketing off, for example, classical musicians and composers (cluster 6) from jazz and popular musicians (cluster 10).

Consider further cluster 5, which seems to be about famous scientists.

There are, however, some interestingly noisy clusters, like cluster 12. With people like Queen Victoria and William McKinley in this cluster, it at first appears to be about national leaders of various sorts, but the inclusion of others like Al Capone (the gangster) and Ernie Pyle (a journalist) complicate this. If you take a closer look, what really seems to be tying these obituaries together is war. Nearly everyone here was involved in war in some fashion or another – save for Capone, whose inclusion makes for strange bedfellows.

Depending on your task, these detailed distinctions may not be so desirable. But for us, the document embeddings provide a wonderfully nuanced view of the kinds of people in the obituaries. From here, further exploration might involve focusing on misfits and outliers. Why, for example, is Capone in cluster 12? Or why is Lou Gehrig all by himself in his own cluster? Of course, we could always re-cluster this data, which would redraw such groupings, but perhaps there is something indeed significant about the way things are divided up as they stand. Word embeddings help bring us to a point where we can begin to undertake such investigations – what comes next depends on which questions we want to ask.

assignment 4 word embeddings

Word embeddings: exploration, explanation, and exploitation (with code in Python)

Halyna Oliinyk

Halyna Oliinyk

Towards Data Science

Word embeddings discussion is the topic being talked about by every natural language processing scientist for many-many years, so don’t expect me to tell you something dramatically new or ‘open your eyes’ on the world of word vectors. I’m here to tell some basic things on word embeddings and describe the most common word embeddings techniques with formulas explained and code snippets attached.

So, as every popular data science book or blog post should always say after introduction part, let’s dive in!

Informal definition

As the Wikipedia will point out, word embedding is

‘the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers’.

Strictly speaking, this definition is absolutely correct but gives not-so-many insights if the person reading it has never been into natural language processing or machine learning techniques. Being more informal, I can state that word embedding is

the vector, which reflects the structure of the word in terms of morphology ( Enriching Word Vectors with Subword Information ) / word-context(s) representation ( word2vec Parameter Learning Explained ) / global corpus statistics ( GloVe: Global Vectors for Word Representation ) / words hierarchy in terms of WordNet terminology ( Poincaré Embeddings for Learning Hierarchical Representations ) / relationship between a set of documents and the terms they contain ( Latent semantic indexing ) / etc.

The idea behind all of the word embeddings is to capture with them as much of the semantical/morphological/context/hierarchical/etc. information as possible, but in practice one methods are definitely better than the other for a particular task (for instance, LSA is quite effective when working in low-dimensional space for the analysis of incoming documents from the same domain zone as the ones, which are already processed and put into the term-document matrix). The problem of choosing the best embeddings for a particular project is always the problem of try-and-fail approach, so realizing why in particular case one model works better than the other sufficiently helps in real work.

In fact, reliable word representation with real-number vector is the goal we’re trying to approach. Sounds easy, isn’t it?

One-hot encoding (CountVectorizing)

The most basic and naive method for transforming words into vectors is to count occurrence of each word in each document, isn’t it? Such an approach is called countvectorizing or one-hot encoding (dependent on the literature).

The idea is to collect a set of documents (they can be words, sentences, paragraphs or even articles) and count the occurrence of every word in them. Strictly speaking, the columns of the resulting matrix are words and the rows are documents.

The code snippet attached is the basic sklearn implementation, full documentation can be found here .

For instance, to the word ‘first’ in the given example corresponds vector [1,0,0,0] , which is the 2nd column of the matrix X . Sometimes the output of this method is called ‘sparse matrix’ as long as X has zeros as the most elements of it and has sparsity as its feature.

TF-IDF transforming

The idea behind this approach is term weighting by exploitation of useful statistical measure called tf-idf . Having a large corpus of documents, words like ‘a’, ‘the’, ‘is’, etc. occur very frequently, but they don’t carry a lot of information. Using one-hot encoding approach we see that vectors of these words are not-so-sparse, claiming, that these words are important and carry a lot of information if being in so many documents. One of the ways to solve this problem is stopwords filtering, but this solution is discrete and not flexible to the domain zone we’re working with.

A native solution to the stopwords issue is using statistical quantity, which looks like:

The first part of it is tf , which means ‘term frequency’. By saying it we simply mean the number of times the word occurs in the document divided by the total number of words in the document:

The second part is idf , which stands for ‘inverse document frequency’, interpreted like inversed number of documents, in which the term we’re interested in occurs. We’re also taking logarithm of this component:

We’re done with the formula, but how do we use it? In our previous method reviewed, we’re having word as a column j occurring n times in the document as a row i . We’re taking the same CountVectorizer matrix calculated earlier and replacing each cell of it by tf-idf score for this term and this document.

Full documentation on this sklearn class can be found here , there are many interesting parameters there that can be used.

Word2Vec ( word2vec parameter learning explained )

As I would say, here the fun begins! Word2Vec is the first neural embedding model (or at least the first, which gained its popularity in 2013) and still the one, which is used by the most of researchers. Doc2Vec, its child, is also the most popular model for paragraphs representation, which was inspired by Word2Vec. In fact, many of the concepts we will be reviewing later are based on the Word2Vec prerequisites, so be sure to pay enough attention to this embeddings type.

There are 3 different types of Word2Vec parameter learning, and all of them are based on the neural network model, so this paragraph will be created with the assumption, that you know what it is.

One-word context The intuition behind it is the fact that we’re considering one word per one context (we’re predicting one word given only one word); this approach is often referred to as CBOW model. The architecture of our neural network is that we’re having a one-hot encoded vector as the input of size V×1 , input → hidden layer weights matrix W of size V×N , hidden layer → output layer weights matrix W’ of size N×V and softmax function as a final activation step. Our goal is to calculate the following probability distribution, which is the vector representation of the word with index I:

We’re assuming that we call our input vector x , with all zeros in it and only one 1 at the position k . Hidden layer h is computed with:

Speaking about this notation, we’re can consider h to be ‘input vector’ of the word x . Every word in our vocabulary has input and output representations; formally, row i of weights matrix W is our ‘input’ vector representation of the word i , so we’re using colon sign to avoid misunderstandings.

As the next step of the neural network, we take vector h and do the following computations:

Our v’ is the output vector of the word w with index j , and for every entry u with index j we do this multiplication operation.

As we’ve said before, the activation step is calculated with standard softmax (negative sampling or hierarchical softmax techniques are welcome):

The diagram on the method captures all of the steps described.

Multi-word context This model has no differences from the one-word context, except the type of probability distribution we want to obtain and the type of hidden layer we’re having. Interpretation of multi-word context is the fact that we’d like to predict multinomial distribution given not only one context word but rather many of them to store information about the relation of our target word to other words from the corpus.

Our probability distribution now looks this way:

To obtain it, we’re changing our hidden layer function to:

Which is simply the average of our context vectors x from 1 to C . Cost function now takes the form of:

All of the other components are the same for this architecture.

Skip-gram model Imagine the situation opposite to CBOW multi-word model: we’d like to predict c context words having one target word on the input. Then, our objective we’re trying to approach changes dramatically:

-c and c are limits of our context window and word with index t is every word from the corpus we’re working with.

Our first step we’re doing to obtain hidden layer is the same as for two previous cases:

Our output layer (without activation) is calculated with:

On the output layer, we’re computing c multinomial distribution; each output panel shares the same weights from the hidden layer → output layer weights matrix W’ . As the activation of output we’re also using softmax with a bit of changed notation according to rather c panels, but not one output panel as we had earlier:

Illustration on the skip-gram calculation replicates all of the stages performed.

Basic implementation of Word2Vec model can be performed with gensim; full documentation is here .

GloVe ( Glove: Global Vectors for Word Representation )

The approach of global word representation is used to capture the meaning of one word embedding with the structure of the whole observed corpus; word frequency and co-occurence counts are the main measures on which the majority of unsupervised algorithms are based on. GloVe model trains on global co-occurrence counts of words and makes a sufficient use of statistics by minimizing least-squares error and, as result, producing a word vector space with meaningful substructure. Such an outline sufficiently preserves words similarities with vector distance.

To store this information we use co-occurrence matrix X , each entry of which corresponds to the number of times word j occurs in the context of word i . As the consequence:

is the probability that word with index j occurs in the context of word i .

Ratios of co-occurrence probabilities are the appropriate starting point to begin word embedding learning. We firstly define a function F as:

which is dependent on 2 word vectors with indexes i and j and separate context vector with index k . F encodes the information, present in the ratio; the most intuitive way to represent this difference in vector form is to subtract one vector from another:

Now in the equation, the left-hand side is the vector, while the right-hand side is the scalar. To avoid this we can calculate the product of 2 terms (product operation still allows us to capture the information we need):

As long as in word-word co-occurrence matrix the distinction between context words and standard words is arbitrary, we can replace the probabilities ratio with:

and solve the equation:

If we assume that F function is exp() , then the solution becomes:

This equation does not preserve symmetry, so we absorb 2 of the terms into biases:

Now our loss function we’re trying to minimize is the linear regression function with some of the modifications:

where f is the weighting function, which is defined manually.

GloVe is also implemented with gensim library, its basic functionality to train on standard corpus is described with this snippet

FastText ( Enriching Word Vectors with Subword Information )

What if we want to take into account morphology of words? To do so, we can still use skip-gram model (remember I told you that skip-gram baseline is used in many other related works) together with negative sampling objective.

Let us consider that we are given a scoring function which maps pairs of the (word, context) to real-valued scores. The problem of predicting context words can be viewed as a sequence of binary classification tasks (predict presence or absence of context words). For word at position t we consider all context words as positive examples and sample negative examples at random from the dictionary.

Now our negative sampling objective is:

The FastText model takes into account internal structure of words by splitting them into a bag of character n-grams and adding to them a whole word as a final feature. If we denote n-gram vector as z and v as output vector representation of word w (context word):

We can choose n-grams of any size, but in practice size from 3 to 6 is the most suitable one.

All of the documentation and examples on FastText implementation with the Facebook library are available here .

Poincaré embeddings ( Poincaré Embeddings for Learning Hierarchical Representations )

Poincaré embeddings are the latest trend in the natural language processing community, based on the fact, that we’re using hyperbolic geometry to capture hierarchical properties of the words we can’t capture directly in Euclidean space. We need to use such kind of geometry together with Poincaré ball to capture the fact, that distance from the root of the tree to its leaves grows exponentially with every new child, and hyperbolic geometry is able to represent this property.

Notes on hyperbolic geometry Hyperbolic geometry studies non-Euclidean spaces of constant negative curvature. Its main 2 theorems are that:

  • for every line l and every point P not on l there pass through P at least two distinct parallels through. Moreover there are infinitely many parallels to l through P;
  • all triangles have angle sum less than 180 degrees.

For 2-dimensional hyperbolic space we see that area s and length l both grow exponentially with formulas:

Hyperbolic geometry is designed the way that we can use the distance in the embedding space we create to reflect the semantic similarity of the words.

The model from the hyperbolic geometry we’re interested in is the Poincaré ball model, stated as:

where the norm we’re using is the standard Euclidean norm. Poincaré ball model can be drawn as a disk where the straight lines consist of all segments of circles contained within the disk that are orthogonal to the boundary of the disk, plus all diameters of the disk.

Poincare embeddings baseline Consider our task to model a tree such that we do it in a metric space and the structure of it is reflected in the embedding; as we know, the number of children in the tree grows exponentially with the distance from the root.

Regular tree with branching factor b can be modeled in hyperbolic geometry in 2 dimensions, such that nodes l levels below the root are located on the sphere with radius l = r and nodes that are less than l levels below the root are located within this sphere. Hyperbolic spaces can be thought as continuous versions of trees, and trees — as discrete hyperbolic spaces.

The distance measure between 2 embeddings we’re defining in hyperbolic space is:

which gives us the ability not only to capture the similarity between embedding effectively (through their distance) but also preserves their hierarchy (through their norm), which we take from WordNet taxonomy.

We’re minimizing the loss function with respect to theta (element from the Poincaré ball, its norm should be no bigger than one):

where D is the set of all observed hyponymy relations and N(u) is the set of negative examples for u .

Training details

  • initialize all embeddings randomly from the uniform distribution;
  • learn embeddings for all symbols in D such that related objects are close in the embedding space.

Conclusions

I consider this article to be a short intro to word embeddings, briefly describing the most common natural language processing techniques, their peculiarities and theoretical foundations. For more details on every method, a link to each of the original papers is attached; most of the formulas are pointed out, but more of the explanations on the notation can be found in the same place.

I haven’t mentioned some of the basic word embedding matrix factorization methodologies like latent semantic indexing and hasn’t paid much of the attention on real-life applications of each of the approaches as long as it all depends on the task and given corpus; for instance, the creators of GloVe claim that their approach worked well on named entity recognition task with CoNNL dataset, but it doesn’t mean that it will work best in the case of unstructured data coming from different domain zones.

Also, paragraph embeddings are not lightened in this article, but this is another story… Is it worth telling it?

Halyna Oliinyk

Written by Halyna Oliinyk

Automate it.

Text to speech

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, word embeddings.

1141 papers with code • 0 benchmarks • 52 datasets

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

Techniques for learning word embeddings can include Word2Vec, GloVe, and other neural network-based approaches that train on an NLP task such as language modeling or document classification.

( Image credit: Dynamic Word Embedding for Evolving Semantic Discovery )

assignment 4 word embeddings

Benchmarks Add a Result

assignment 4 word embeddings

Most implemented papers

Enriching word vectors with subword information.

facebookresearch/fastText • TACL 2017

A vector representation is associated to each character $n$-gram; words being represented as the sum of these representations.

FastText.zip: Compressing text classification models

facebookresearch/fastText • 12 Dec 2016

We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory.

Universal Sentence Encoder

assignment 4 word embeddings

For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance.

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features.

Word Translation Without Parallel Data

We finally describe experiments on the English-Esperanto low-resource language pair, on which there only exists a limited amount of parallel data, to show the potential impact of our method in fully unsupervised machine translation.

Named Entity Recognition with Bidirectional LSTM-CNNs

Named entity recognition is a challenging task that has traditionally required large amounts of knowledge in the form of feature engineering and lexicons to achieve high performance.

Evaluation of sentence embeddings in downstream and linguistic probing tasks

Despite the fast developmental pace of new sentence embedding methods, it is still challenging to find comprehensive evaluations of these different techniques.

Topic Modeling in Embedding Spaces

To this end, we develop the Embedded Topic Model (ETM), a generative model of documents that marries traditional topic models with word embeddings.

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck.

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

uzairakbar/info-retrieval • NeurIPS 2016

Geometrically, gender bias is first shown to be captured by a direction in the word embedding.

nlp-recipes

Natural language processing best practices & examples, word embedding.

This folder contains examples and best practices, written in Jupyter notebooks, for training word embedding on custom data from scratch. There are three typical ways for training word embedding: Word2Vec , GloVe , and fastText . All of the three methods provide pretrained models ( pretrained model with Word2Vec , pretrained model with Glove , pretrained model with fastText ). These pretrained models are trained with general corpus like Wikipedia data, Common Crawl data, etc., and may not serve well for situations where you have a domain-specific language learning problem or there is no pretrained model for the language you need to work with. In this folder, we provide examples of how to apply each of the three methods to train your own word embeddings.

What is Word Embedding?

Word embedding is a technique to map words or phrases from a vocabulary to vectors or real numbers. The learned vector representations of words capture syntactic and semantic word relationships and therefore can be very useful for tasks like sentence similary, text classifcation, etc.

Notebook Environment Description Dataset Language
Local A notebook shows how to learn word representation with Word2Vec, fastText and Glove en
  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

Word Embeddings in NLP

Word Embeddings are numeric representations of words in a lower-dimensional space, capturing semantic and syntactic information. They play a vital role in Natural Language Processing (NLP) tasks . This article explores traditional and neural approaches, such as TF-IDF, Word2Vec, and GloVe, offering insights into their advantages and disadvantages. Understanding the importance of pre-trained word embeddings, providing a comprehensive understanding of their applications in various NLP scenarios.

What is Word Embedding in NLP?

Word Embedding is an approach for representing words and documents. Word Embedding or Word Vector is a numeric vector input that represents a word in a lower-dimensional space. It allows words with similar meanings to have a similar representation.

Word Embeddings are a method of extracting features out of text so that we can input those features into a machine learning model to work with text data. They try to preserve syntactical and semantic information. The methods such as Bag of Words (BOW) , CountVectorizer and TFIDF rely on the word count in a sentence but do not save any syntactical or semantic information. In these algorithms, the size of the vector is the number of elements in the vocabulary. We can get a sparse matrix if most of the elements are zero. Large input vectors will mean a huge number of weights which will result in high computation required for training. Word Embeddings give a solution to these problems.

Need for Word Embedding?

  • To reduce dimensionality
  • To use a word to predict the words around it.
  • Inter-word semantics must be captured.

How are Word Embeddings used?

  • They are used as input to machine learning models. Take the words —-> Give their numeric representation —-> Use in training or inference.
  • To represent or visualize any underlying patterns of usage in the corpus that was used to train them.

Let’s take an example to understand how word vector is generated by taking emotions which are most frequently used in certain conditions and transform each emoji into a vector and the conditions will be our features.

emoji

In a similar way, we can create word vectors for different words as well on the basis of given features. The words with similar vectors are most likely to have the same meaning or are used to convey the same sentiment.

Approaches for Text Representation

1. traditional approach.

The conventional method involves compiling a list of distinct terms and giving each one a unique integer value, or id. and after that, insert each word’s distinct id into the sentence. Every vocabulary word is handled as a feature in this instance. Thus, a large vocabulary will result in an extremely large feature size. Common traditional methods include:

1.1. One-Hot Encoding

One-hot encoding is a simple method for representing words in natural language processing (NLP). In this encoding scheme, each word in the vocabulary is represented as a unique vector, where the dimensionality of the vector is equal to the size of the vocabulary. The vector has all elements set to 0, except for the element corresponding to the index of the word in the vocabulary, which is set to 1.

While one-hot encoding is a simple and intuitive method for representing words in NLP, it has several disadvantages, which may limit its effectiveness in certain applications.

  • One-hot encoding results in high-dimensional vectors, making it computationally expensive and memory-intensive, especially with large vocabularies.
  • It does not capture semantic relationships between words; each word is treated as an isolated entity without considering its meaning or context.
  • It is restricted to the vocabulary seen during training, making it unsuitable for handling out-of-vocabulary words.

1.2. Bag of Word (Bow)

Bag-of-Words (BoW) is a text representation technique that represents a document as an unordered set of words and their respective frequencies. It discards the word order and captures the frequency of each word in the document, creating a vector representation.

While BoW is a simple and interpretable representation, below disadvantages highlight its limitations in capturing certain aspects of language structure and semantics:

  • BoW ignores the order of words in the document, leading to a loss of sequential information and context making it less effective for tasks where word order is crucial, such as in natural language understanding.
  • BoW representations are often sparse, with many elements being zero resulting in increased memory requirements and computational inefficiency, especially when dealing with large datasets.

1.3. Term frequency-inverse document frequency (TF-IDF)

Term Frequency-Inverse Document Frequency , commonly known as TF-IDF, is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It is widely used in natural language processing and information retrieval to evaluate the significance of a term within a specific document in a larger corpus. TF-IDF consists of two components:

  • Term Frequency (TF): Term Frequency measures how often a term (word) appears in a document. It is calculated using the formula:

\text{TF}(t,d) = \frac{\text{Total number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}

  • Inverse Document Frequency (IDF): Inverse Document Frequency measures the importance of a term across a collection of documents. It is calculated using the formula:

\text{IDF}(t,D) = \log\left(\frac{\text{Total documents }}{\text{Number of documents containing term t}}\right)

The TF-IDF score for a term t in a document d is then given by multiplying the TF and IDF values:

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)

The higher the TF-IDF score for a term in a document, the more important that term is to that document within the context of the entire corpus. This weighting scheme helps in identifying and extracting relevant information from a large collection of documents, and it is commonly used in text mining, information retrieval, and document clustering.

Let’s Implement Term Frequency-Inverse Document Frequency (TF-IDF) using python with the scikit-learn library. It begins by defining a set of sample documents. The TfidfVectorizer is employed to transform these documents into a TF-IDF matrix. The code then extracts and prints the TF-IDF values for each word in each document. This statistical measure helps assess the importance of words in a document relative to their frequency across a collection of documents, aiding in information retrieval and text analysis tasks.

TF-IDF is a widely used technique in information retrieval and text mining, but its limitations should be considered, especially when dealing with tasks that require a deeper understanding of language semantics. For example:

  • TF-IDF treats words as independent entities and doesn’t consider semantic relationships between them. This limitation hinders its ability to capture contextual information and word meanings.
  • Sensitivity to Document Length: Longer documents tend to have higher overall term frequencies, potentially biasing TF-IDF towards longer documents.

2. Neural Approach

2.1. word2vec.

Word2Vec is a neural approach for generating word embeddings. It belongs to the family of neural word embedding techniques and specifically falls under the category of distributed representation models. It is a popular technique in natural language processing (NLP) that is used to represent words as continuous vector spaces. Developed by a team at Google, Word2Vec aims to capture the semantic relationships between words by mapping them to high-dimensional vectors. The underlying idea is that words with similar meanings should have similar vector representations. In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector .

There are two neural embedding methods for Word2Vec, Continuous Bag of Words (CBOW) and Skip-gram.

2.2. Continuous Bag of Words(CBOW)

Continuous Bag of Words (CBOW) is a type of neural network architecture used in the Word2Vec model. The primary objective of CBOW is to predict a target word based on its context, which consists of the surrounding words in a given window. Given a sequence of words in a context window, the model is trained to predict the target word at the center of the window.

CBOW is a feedforward neural network with a single hidden layer. The input layer represents the context words, and the output layer represents the target word. The hidden layer contains the learned continuous vector representations (word embeddings) of the input words.

The architecture is useful for learning distributed representations of words in a continuous vector space.

assignment 4 word embeddings

The hidden layer contains the continuous vector representations (word embeddings) of the input words.

  • The weights between the input layer and the hidden layer are learned during training.
  • The dimensionality of the hidden layer represents the size of the word embeddings (the continuous vector space).

2.3. Skip-Gram

The Skip-Gram model learns distributed representations of words in a continuous vector space. The main objective of Skip-Gram is to predict context words (words surrounding a target word) given a target word. This is the opposite of the Continuous Bag of Words (CBOW) model, where the objective is to predict the target word based on its context. It is shown that this method produces more meaningful embeddings.

assignment 4 word embeddings

After applying the above neural embedding methods we get trained vectors of each word after many iterations through the corpus. These trained vectors preserve syntactical or semantic information and are converted to lower dimensions. The vectors with similar meaning or semantic information are placed close to each other in space.

Let’s understand with a basic example. The python code contains, vector_size parameter that controls the dimensionality of the word vectors, and you can adjust other parameters such as window based on your specific needs.

Note: Word2Vec models can perform better with larger datasets. If you have a large corpus, you might achieve more meaningful word embeddings.

In practice, the choice between CBOW and Skip-gram often depends on the specific characteristics of the data and the task at hand. CBOW might be preferred when training resources are limited, and capturing syntactic information is important. Skip-gram, on the other hand, might be chosen when semantic relationships and the representation of rare words are crucial.

3. Pretrained Word-Embedding

Pre-trained word embeddings are representations of words that are learned from large corpora and are made available for reuse in various natural language processing (NLP) tasks. These embeddings capture semantic relationships between words, allowing the model to understand similarities and relationships between different words in a meaningful way.

GloVe is trained on global word co-occurrence statistics. It leverages the global context to create word embeddings that reflect the overall meaning of words based on their co-occurrence probabilities. this method, we take the corpus and iterate through it and get the co-occurrence of each word with other words in the corpus. We get a co-occurrence matrix through this. The words which occur next to each other get a value of 1, if they are one word apart then 1/2, if two words apart then 1/3 and so on.

Let us take an example to understand how the matrix is created. We have a small corpus:

 itisaniceeveninggood
it0     
is1+10    
a1/2+11+1/20   
nice1/3+1/21/2+1/31+10  
evening1/4+1/31/3+1/41/2+1/21+10 
good000010

The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather information about the context in which the word is used.

Initially, the vectors for each word is assigned randomly. Then we take two pairs of vectors and see how close they are to each other in space. If they occur together more often or have a higher value in the co-occurrence matrix and are far apart in space then they are brought close to each other. If they are close to each other but are rarely or not frequently used together then they are moved further apart in space.

After many iterations of the above process, we’ll get a vector space representation that approximates the information from the co-occurrence matrix. The performance of GloVe is better than Word2Vec in terms of both semantic and syntactic capturing.

3.2. Fasttext

Developed by Facebook, FastText extends Word2Vec by representing words as bags of character n-grams. This approach is particularly useful for handling out-of-vocabulary words and capturing morphological variations.

3.3. BERT (Bidirectional Encoder Representations from Transformers)

BERT is a transformer-based model that learns contextualized embeddings for words. It considers the entire context of a word by considering both left and right contexts, resulting in embeddings that capture rich contextual information.

   

Considerations for Deploying Word Embedding Models

  • You need to use the exact same pipeline during deploying your model as were used to create the training data for the word embedding. If you use a different tokenizer or different method of handling white space, punctuation etc. you might end up with incompatible inputs.
  • Words in your input that doesn’t have a pre-trained vector. Such words are known as Out of Vocabulary Word(oov). What you can do is replace those words with “UNK” which means unknown and then handle them separately.
  • Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of length say 400 and then try to apply vectors of length 1000 at inference time, you will run into errors. So make sure to use the same dimensions throughout.

Advantages and Disadvantage of Word Embeddings

  • It is much faster to train than hand build models like WordNet (which uses graph embeddings ).
  • Almost all modern NLP applications start with an embedding layer.
  • It Stores an approximation of meaning.

Disadvantages

  • It can be memory intensive.
  • It is corpus dependent. Any underlying bias will have an effect on your model.
  • It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc.

In conclusion, word embedding techniques such as TF-IDF, Word2Vec, and GloVe play a crucial role in natural language processing by representing words in a lower-dimensional space, capturing semantic and syntactic information.

Frequently Asked Questions (FAQs)

1. does gpt use word embeddings.

GPT uses context-based embeddings rather than traditional word embeddings. It captures word meaning in the context of the entire sentence.

2. What is the difference between Bert and word embeddings?

BERT is contextually aware, considering the entire sentence, while traditional word embeddings, like Word2Vec, treat each word independently.

3. What are the two types of word embedding?

Word embeddings can be broadly evaluated in two categories,  intrinsic and extrinsic . For intrinsic evaluation, word embeddings are used to calculate or predict semantic similarity between words, terms, or sentences.

4. How does word vectorization work?

Word vectorization converts words into numerical vectors, capturing semantic relationships. Techniques like TF-IDF, Word2Vec, and GloVe are common.

5. What are the benefits of word embeddings?

Word embeddings offer semantic understanding, capture context, and enhance NLP tasks. They reduce dimensionality, speed up training, and aid in language pattern recognition.

Please Login to comment...

Similar reads.

  • AI-ML-DS With Python

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Notes, Assignments and Relevant stuff from NLP course by deeplearning.ai

yoongtr/Coursera---Natural-Language-Processing-specialization

Folders and files.

NameName
50 Commits

Repository files navigation

Natural language processing specialization.

Notes, Assignments and Relevant stuff from NLP course by deeplearning.ai, hosted on Coursera.

Course 1: Natural Language Processing with Classification and Vector Spaces

Week 1: sentiment analysis with logistic regression.

  • Natural Language Preprocessing
  • Visualizing Word Frequencies
  • Visualizing Tweets and Logistic Regression models
  • Assignment 1

Week 2: Sentiment Analysis with Naive Bayes

  • Visualizing likelihoods and confidence ellipses
  • Assignment 2

Week 3: Vector Space Models

  • Linear algebra in Python with Numpy
  • Manipulating word embeddings
  • Another explanation about PCA
  • Assignment 3

Week 4: Machine Translation and Document Search

  • Rotation matrices in L2
  • Hash tables
  • Assignment 4

Course 2: Natural Language Processing with Probabilistic Models

Week 1: autocorrect.

  • Building the vocabulary
  • Candidates from edits

Week 2: Part of Speech Tagging and Hidden Markov Models

  • Parts-of-Speech Tagging - First Steps: Working with text files, Creating a Vocabulary and Handling Unknown Words
  • Parts-of-Speech Tagging - Working with tags and Numpy

Week 3: Autocomplete and Language Models

  • N-grams Corpus preprocessing
  • Building the language model
  • Out of vocabulary words (OOV)

Week 4: Word embeddings with Neural Networks

  • Word Embeddings: Ungraded Practice Notebook
  • Word Embeddings First Steps: Data Preparation
  • Word Embeddings: Intro to CBOW model, activation functions and working with Numpy
  • Word Embeddings: Training the CBOW model
  • Word Embeddings: Hands On

Course 3: Natural Language Processing with Sequence Models

Week 1: neural network for sentiment analysis.

  • Introduction to Trax
  • Classes and Subclasses
  • Data Generators

Week 2: Recurrent Neural Networks for Language Modeling

  • Hidden State Activation
  • Working with JAX NumPy and Calculating Perplexity
  • Vanilla RNNs, GRUs and the scan function
  • Creating a GRU model using Trax

Week 3: LSTM and Name Entity Recognition

  • Vanishing Gradient

Week 4: Siamese Networks

  • Creating a Siamese Model using Trax
  • Modified Triplet Loss

Course 4: Natural Language Processing with Attention Models

Week 1: neural machine translation.

  • Stack Semantics

Week 2: Text Summarization

  • The Transformer Decoder

Week 3: Question Answering

  • SentencePiece and BPE

Week 4: Chatbot

  • Reformer LSH
  • Jupyter Notebook 100.0%

IMAGES

  1. Homework 4

    assignment 4 word embeddings

  2. Sent2Vec/Doc2Vec Model

    assignment 4 word embeddings

  3. Word Embeddings Demo

    assignment 4 word embeddings

  4. Word Embedding: Basics. Create a vector from a word

    assignment 4 word embeddings

  5. The Ultimate Guide to Word Embeddings

    assignment 4 word embeddings

  6. word embedding presentation

    assignment 4 word embeddings

COMMENTS

  1. Assignment 4: Word Embeddings

    Assignment 4: Word Embeddings. Welcome to the fourth (and last) programming assignment of Course 2! In this assignment, you will practice how to compute word embeddings and use them for sentiment analysis. To implement sentiment analysis, you can go beyond counting the number of positive words and negative words.

  2. Assignment 4: Word Embeddings

    Assignment 4: Word Embeddings. Welcome to the fourth (and last) programming assignment of Course 2! In this assignment, you will practice how to compute word embeddings and use them for sentiment analysis. To implement sentiment analysis, you can go beyond counting the number of positive words and negative words.

  3. Coursera-Deep-Learning/Natural Language Processing with ...

    / Week 4 - Word Embeddings with Neural Networks / C2_W4_Assignment.ipynb. Blame. Blame. Latest commit ...

  4. amanchadha/coursera-natural-language-processing-specialization

    Use logistic regression, naïve Bayes, and word vectors to implement sentiment analysis, complete analogies, and translate words, and use locality sensitive hashing for approximate nearest neighbors. Use dynamic programming, hidden Markov models, and word embeddings to autocorrect misspelled words, autocomplete partial sentences, and identify ...

  5. PDF Assignment 4

    Assignment 4 CSE 447 and 517: Natural Language Processing - University of Washington ... you need to download for this assignment. This assignment is designed to advance your understanding of word embeddings and NLP models that make use of them. F problemsare for CSE 517 students only. Other problems should be completed by everyone. Submit: You ...

  6. assignment_04_solutions

    Assignment #4 Solutions ... vectors of our target word and the context words become closer while making our target embedding and all irrelevant word embeddings become less similar. This is actually the main issue, because using all irrelevant words is unnecessary, causing soft max activation computations be too heavy. ...

  7. PDF CS 224n: Assignment #4

    CS 224n Assignment 4 Page 2 of 7 Model description (training procedure) Given a sentence in the source language, we look up the word embeddings from an embeddings matrix, yielding x 1;:::;x m (x i 2Re 1), where mis the length of the source sentence and eis the embedding size.

  8. Word Embeddings

    Assignment 4 - Naive Machine Translation and LSH Deep Learning Course#2: Probabilistic Models Autocorrect Part of Speech Tagging Autocomplete Word Embeddings Assignment 1: Autocorrect Assignment 2: Parts-of-Speech Tagging (POS) Assignment 3: Language Models: Auto-Complete Assignment 4: Word Embeddings

  9. Word Embeddings: Encoding Lexical Semantics

    Typically, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings. It almost always helps performance a couple of percent. The CBOW model is as follows.

  10. CoCalc -- C2_W4_Assignment.ipynb

    Assignment 4: Word Embeddings . Welcome to the fourth (and last) programming assignment of Course 2! In this assignment, you will practice how to compute word embeddings and use them for sentiment analysis. To implement sentiment analysis, you can go beyond counting the number of positive words and negative words.

  11. Word embeddings

    Word embeddings give us a way to use an efficient, dense representation in which similar words have a similar encoding. Importantly, you do not have to specify this encoding by hand. An embedding is a dense vector of floating point values (the length of the vector is a parameter you specify). Instead of specifying the values for the embedding ...

  12. Word Embeddings: Encoding Lexical Semantics

    Typcially, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings. It almost always helps performance a couple of percent. The CBOW model is as follows.

  13. Assignment 4

    This assignment covers the folowing topics: 1. The word embeddings data for English and French words. 1.1 Generate embedding and transform matrices. Exercise 1. 2. Translations. 2.1 Translation as linear transformation of embeddings. Exercise 2.

  14. 4. Word Embeddings

    4.3. Using an Embeddings Model# At this point, we are at a crossroads. On the one hand, we could train a word embeddings model using our corpus documents as is. The gensim library offers functionality for this, and it's a relatively easy operation. On the other, we could use pre-made embeddings, which are usually trained on a more general ...

  15. Natural-Language-Processing-Specialization/2_Probabilistic ...

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window.

  16. Word embeddings: exploration, explanation, and exploitation (with code

    Word embeddings discussion is the topic being talked about by every natural language processing scientist for many-many years, so don't expect me to tell you something dramatically new or 'open your eyes' on the world of word vectors. I'm here to tell some basic things on word embeddings and describe the most common word embeddings ...

  17. Word Embeddings

    Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Techniques for learning word embeddings can include Word2Vec, GloVe, and other neural network-based approaches that train on an NLP task such as language modeling or document ...

  18. Word Embedding

    What is Word Embedding? Word embedding is a technique to map words or phrases from a vocabulary to vectors or real numbers. The learned vector representations of words capture syntactic and semantic word relationships and therefore can be very useful for tasks like sentence similary, text classifcation, etc.

  19. Word Embedding in Pytorch

    It also captures the overall meaning/ context of the words and sentences which is better than random assignment of vectors. Embedding. Embeddings are real-valued dense vectors (multi-dimensional arrays) that carry the meaning of the words. They can capture the context of the word/sentence in a document, semantic similarity, relation with other ...

  20. Natural Language Processing Specialization

    Train a neural network with GLoVe word embeddings to perform sentiment analysis of tweets; Week 2: Language Generation Models. Generate synthetic Shakespeare text using a Gated Recurrent Unit (GRU) language model; Week 3: Named Entity Recognition (NER) Train a recurrent neural network to perform NER using LSTMs with linear layers; Week 4 ...

  21. Coursera-Deep-Learning/Natural Language Processing with ...

    Issues 4; Pull requests 1; Actions; Projects 0; Security; Insights; Files master. Breadcrumbs. Coursera-Deep-Learning / Natural Language Processing with Probabilistic Models / Week 4 - Word Embeddings with Neural Networks / NLP_C2_W4_lecture_nb_01.ipynb. Blame. Blame. Latest commit History History. 2169 lines (2169 loc) · 57.6 KB ...

  22. Word Embeddings in NLP

    Word Embeddings are numeric representations of words in a lower-dimensional space, capturing semantic and syntactic information. They play a vital role in Natural Language Processing (NLP) tasks.This article explores traditional and neural approaches, such as TF-IDF, Word2Vec, and GloVe, offering insights into their advantages and disadvantages.

  23. yoongtr/Coursera---Natural-Language-Processing-specialization

    Word Embeddings: Intro to CBOW model, activation functions and working with Numpy; Word Embeddings: Training the CBOW model; Word Embeddings: Hands On; Assignment 4; Course 3: Natural Language Processing with Sequence Models. Week 1: Neural Network for Sentiment Analysis. Introduction to Trax;