Datathon 3

This is the second practice exercise that follows the overall structure of final assessment for the course. This is a formative assessment designed to help you develop necessary expertise for the final, summative assessment.

You will have two weeks to complete the exercise. You will present your results during the seminars on November 28.

Groups

You are randomly assigned to a group for each datathon exercise. Click on the Groups button at the bottom to find out which group you’re assigned to and exchange contact information with your teammates.

Dataset

For the datathon you will use the UN General Debates Corpus. You need to download the file UNGDC 1970-2016.zip. Do not download the PDFs unless you want to read the sessions in the original format.

You can read more about the data (and some applications) here: (https://arxiv.org/abs/1707.02774). Using UNGD corpus and any other data source you deem relevant, you are asked to address any issue of interest to you that involves extracting insights from data. Methodologically you can use any of the methods we covered in the course so far. Additionally, practice merging and cleaning datasets using tools in the tidyverse (http://tidyverse.org).

For practical aspects of working with unstructured data in R, I suggest you work through:

For help with tidyverse, you can work through the exercises in this tutorial:

Loading UNGD Corpus in R

library(readtext)
library(quanteda)

Load a small sample (only 10 speeches) of UNGD dataset:

ungd_debates <- readtext("https://uclspp.github.io/datasets/data/ungd_sample10.zip",
                         docvarsfrom = "filenames", 
                         dvsep="_",   
                         docvarnames=c("Country", "Session", "Year"))

ungd_corpus <- corpus(ungd_debates)

ungd_dfm <- dfm(ungd_corpus, 
                stem = TRUE,
                remove = stopwords("english"), 
                remove_punct = TRUE,
                remove_numbers = TRUE)

ungd_dfm
Document-feature matrix of: 10 documents, 2,672 features (76.6% sparse).