8 Word Embeddings

Datathon 3

The final datathon exercise is due in two weeks. Click on the button below for details:

8.1 Seminar

8.1.1 Exercise

8.1.1.1 With Quanteda

Updating Quanteda

You might run into problems creating a term co-occurance matrix with the version of quanteda currently on CRAN. It is recommended to update it from the latest on github:

devtools::install_github("kbenoit/quanteda")

library(text2vec)
library(quanteda)
library(readtext)

Download the 4.8GB of English language wikipedia dump from Mar 3, 2006. You can read more about the dataset here: http://mattmahoney.net/dc/textdata.html

DATA_DIR <- "~/PUBLG088/data"

if (!dir.exists(DATA_DIR)) {
  dir.create(DATA_DIR, recursive = TRUE)
}

text8_file <- file.path(DATA_DIR, "text8")
text8_zipfile <- paste0(text8_file, ".zip")

if (!file.exists(text8_file)) {
  download.file("http://mattmahoney.net/dc/text8.zip", text8_zipfile)
  unzip(text8_zipfile, exdir = DATA_DIR)
}

Load the file using readtext and create a quanteda corpus

wiki_text <- readtext(text8_file)
wiki_corpus <- corpus(wiki_text)

Create a document feature matrix from the wikipedia text.

wiki_dfm <- dfm(wiki_corpus)
wiki_dfm

Document-feature matrix of: 1 document, 253,853 features (0% sparse).

Trim the vocabulary to only include words that appear at least 5 times.

wiki_vocab <- featnames(dfm_trim(wiki_dfm, min_count = 5))
length(wiki_vocab)

[1] 71290

Select the tokens that are in the vocabulary

wiki_tokens <- tokens(wiki_corpus) 
wiki_tokens <- tokens_select(wiki_tokens, wiki_vocab, padding = TRUE)

Create a term co-occurance matrix

tcm <- fcm(wiki_tokens, context = "window", count = "weighted", weights = 1/(1:5))
tcm

Feature co-occurrence matrix of: 71,290 by 71,290 features.

If tcm is created sucessfully with quanteda, you can skip to the Running GloVe section

8.1.1.2 Without Quanteda

If you encounter problems creating the term co-occurance matrix with quanteda, you can use this method instead:

Now load the data with readLines

wiki_text = readLines(text8_file, n = 1, warn = FALSE)

Tokenize by whitespace.

tokens <- space_tokenizer(wiki_text)

Create vocabulary. Terms will be unigrams (simple words).

token_iterator = itoken(tokens, progressbar = FALSE)
wiki_vocab <- create_vocabulary(token_iterator)

wiki_vocab <- prune_vocabulary(wiki_vocab, term_count_min = 5L)

Use our filtered vocabulary

tcm <- create_tcm(token_iterator, vocab_vectorizer(wiki_vocab))

8.1.2 Running GloVe

Now we can run GloVe. Be patient, it will take several minutes to run.

glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = wiki_vocab, x_max = 10)
wv_main <- glove$fit_transform(tcm, n_iter = 20)

INFO [2017-12-20 23:42:46] 2017-12-20 23:42:46 - epoch 1, expected cost 0.0806
INFO [2017-12-20 23:43:21] 2017-12-20 23:43:21 - epoch 2, expected cost 0.0618
INFO [2017-12-20 23:43:56] 2017-12-20 23:43:56 - epoch 3, expected cost 0.0539
INFO [2017-12-20 23:44:31] 2017-12-20 23:44:31 - epoch 4, expected cost 0.0498
INFO [2017-12-20 23:45:07] 2017-12-20 23:45:07 - epoch 5, expected cost 0.0473
INFO [2017-12-20 23:45:42] 2017-12-20 23:45:42 - epoch 6, expected cost 0.0456
INFO [2017-12-20 23:46:17] 2017-12-20 23:46:17 - epoch 7, expected cost 0.0443
INFO [2017-12-20 23:46:53] 2017-12-20 23:46:53 - epoch 8, expected cost 0.0432
INFO [2017-12-20 23:47:28] 2017-12-20 23:47:28 - epoch 9, expected cost 0.0424
INFO [2017-12-20 23:48:03] 2017-12-20 23:48:03 - epoch 10, expected cost 0.0417
INFO [2017-12-20 23:48:36] 2017-12-20 23:48:36 - epoch 11, expected cost 0.0411
INFO [2017-12-20 23:49:10] 2017-12-20 23:49:10 - epoch 12, expected cost 0.0406
INFO [2017-12-20 23:49:44] 2017-12-20 23:49:44 - epoch 13, expected cost 0.0401
INFO [2017-12-20 23:50:17] 2017-12-20 23:50:17 - epoch 14, expected cost 0.0397
INFO [2017-12-20 23:50:52] 2017-12-20 23:50:52 - epoch 15, expected cost 0.0394
INFO [2017-12-20 23:51:26] 2017-12-20 23:51:26 - epoch 16, expected cost 0.0391
INFO [2017-12-20 23:52:00] 2017-12-20 23:52:00 - epoch 17, expected cost 0.0388
INFO [2017-12-20 23:52:34] 2017-12-20 23:52:34 - epoch 18, expected cost 0.0385
INFO [2017-12-20 23:53:09] 2017-12-20 23:53:09 - epoch 19, expected cost 0.0383
INFO [2017-12-20 23:53:43] 2017-12-20 23:53:43 - epoch 20, expected cost 0.0381

Check the dimentions of the transformed matrix returned by Glove.

dim(wv_main)

[1] 71290    50

Extract the contexts from the glove object

wv_context <- glove$components
dim(wv_context)

[1]    50 71290

We can take the sum or the average of the two vectors

word_vectors <- wv_main + t(wv_context)

Now check the closest term to our paris - france + germany example:

word_analogy <- word_vectors["paris", , drop = FALSE] - 
                word_vectors["france", , drop = FALSE] + 
                word_vectors["germany", , drop = FALSE]

cos_sim = sim2(x = word_vectors, 
               y = word_analogy, 
               method = "cosine", 
               norm = "l2")

head(sort(cos_sim[,1], decreasing = TRUE))

   berlin    munich     paris   germany    vienna    venice 
0.7458732 0.7296414 0.6953966 0.6906732 0.6423478 0.6253808