8 Word Embeddings
Datathon 3
The final datathon exercise is due in two weeks. Click on the button below for details:
8.1 Seminar
8.1.1 Exercise
8.1.1.1 With Quanteda
Updating Quanteda
You might run into problems creating a term co-occurance matrix with the version of quanteda
currently on CRAN. It is recommended to update it from the latest on github:
devtools::install_github("kbenoit/quanteda")
library(text2vec)
library(quanteda)
library(readtext)
Download the 4.8GB of English language wikipedia dump from Mar 3, 2006. You can read more about the dataset here: http://mattmahoney.net/dc/textdata.html
DATA_DIR <- "~/PUBLG088/data"
if (!dir.exists(DATA_DIR)) {
dir.create(DATA_DIR, recursive = TRUE)
}
text8_file <- file.path(DATA_DIR, "text8")
text8_zipfile <- paste0(text8_file, ".zip")
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", text8_zipfile)
unzip(text8_zipfile, exdir = DATA_DIR)
}
Load the file using readtext
and create a quanteda corpus
wiki_text <- readtext(text8_file)
wiki_corpus <- corpus(wiki_text)
Create a document feature matrix from the wikipedia text.
wiki_dfm <- dfm(wiki_corpus)
wiki_dfm
Document-feature matrix of: 1 document, 253,853 features (0% sparse).
Trim the vocabulary to only include words that appear at least 5 times.
wiki_vocab <- featnames(dfm_trim(wiki_dfm, min_count = 5))
length(wiki_vocab)
[1] 71290
Select the tokens that are in the vocabulary
wiki_tokens <- tokens(wiki_corpus)
wiki_tokens <- tokens_select(wiki_tokens, wiki_vocab, padding = TRUE)
Create a term co-occurance matrix
tcm <- fcm(wiki_tokens, context = "window", count = "weighted", weights = 1/(1:5))
tcm
Feature co-occurrence matrix of: 71,290 by 71,290 features.
If tcm
is created sucessfully with quanteda, you can skip to the Running GloVe section
8.1.1.2 Without Quanteda
If you encounter problems creating the term co-occurance matrix with quanteda, you can use this method instead:
Now load the data with readLines
wiki_text = readLines(text8_file, n = 1, warn = FALSE)
Tokenize by whitespace.
tokens <- space_tokenizer(wiki_text)
Create vocabulary. Terms will be unigrams (simple words).
token_iterator = itoken(tokens, progressbar = FALSE)
wiki_vocab <- create_vocabulary(token_iterator)
wiki_vocab <- prune_vocabulary(wiki_vocab, term_count_min = 5L)
Use our filtered vocabulary
tcm <- create_tcm(token_iterator, vocab_vectorizer(wiki_vocab))
8.1.2 Running GloVe
Now we can run GloVe. Be patient, it will take several minutes to run.
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = wiki_vocab, x_max = 10)
wv_main <- glove$fit_transform(tcm, n_iter = 20)
INFO [2017-12-20 23:42:46] 2017-12-20 23:42:46 - epoch 1, expected cost 0.0806
INFO [2017-12-20 23:43:21] 2017-12-20 23:43:21 - epoch 2, expected cost 0.0618
INFO [2017-12-20 23:43:56] 2017-12-20 23:43:56 - epoch 3, expected cost 0.0539
INFO [2017-12-20 23:44:31] 2017-12-20 23:44:31 - epoch 4, expected cost 0.0498
INFO [2017-12-20 23:45:07] 2017-12-20 23:45:07 - epoch 5, expected cost 0.0473
INFO [2017-12-20 23:45:42] 2017-12-20 23:45:42 - epoch 6, expected cost 0.0456
INFO [2017-12-20 23:46:17] 2017-12-20 23:46:17 - epoch 7, expected cost 0.0443
INFO [2017-12-20 23:46:53] 2017-12-20 23:46:53 - epoch 8, expected cost 0.0432
INFO [2017-12-20 23:47:28] 2017-12-20 23:47:28 - epoch 9, expected cost 0.0424
INFO [2017-12-20 23:48:03] 2017-12-20 23:48:03 - epoch 10, expected cost 0.0417
INFO [2017-12-20 23:48:36] 2017-12-20 23:48:36 - epoch 11, expected cost 0.0411
INFO [2017-12-20 23:49:10] 2017-12-20 23:49:10 - epoch 12, expected cost 0.0406
INFO [2017-12-20 23:49:44] 2017-12-20 23:49:44 - epoch 13, expected cost 0.0401
INFO [2017-12-20 23:50:17] 2017-12-20 23:50:17 - epoch 14, expected cost 0.0397
INFO [2017-12-20 23:50:52] 2017-12-20 23:50:52 - epoch 15, expected cost 0.0394
INFO [2017-12-20 23:51:26] 2017-12-20 23:51:26 - epoch 16, expected cost 0.0391
INFO [2017-12-20 23:52:00] 2017-12-20 23:52:00 - epoch 17, expected cost 0.0388
INFO [2017-12-20 23:52:34] 2017-12-20 23:52:34 - epoch 18, expected cost 0.0385
INFO [2017-12-20 23:53:09] 2017-12-20 23:53:09 - epoch 19, expected cost 0.0383
INFO [2017-12-20 23:53:43] 2017-12-20 23:53:43 - epoch 20, expected cost 0.0381
Check the dimentions of the transformed matrix returned by Glove.
dim(wv_main)
[1] 71290 50
Extract the contexts from the glove object
wv_context <- glove$components
dim(wv_context)
[1] 50 71290
We can take the sum or the average of the two vectors
word_vectors <- wv_main + t(wv_context)
Now check the closest term to our paris - france + germany example:
word_analogy <- word_vectors["paris", , drop = FALSE] -
word_vectors["france", , drop = FALSE] +
word_vectors["germany", , drop = FALSE]
cos_sim = sim2(x = word_vectors,
y = word_analogy,
method = "cosine",
norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE))
berlin munich paris germany vienna venice
0.7458732 0.7296414 0.6953966 0.6906732 0.6423478 0.6253808