PUBL0099: Quantitative Text Analysis for Social Science

Author

Jack Blumenau

Course Information

This is the course website for PUBL0099: Quantitative Text Analysis for Social Science, delivered by the Department of Political Science at UCL.

Key information

  • Module Name: Quantitative Text Analysis for Social Science
  • Module Code: PUBL0099
  • Lecturers:
  • Student Support and Feedback Hours:
    • Jack: Tuesdays, 9-11am, book here.
    • Michael: XX, XXam, weeks 8 and 9, book here.
  • Credits: 15
  • Assessment Method: 3000 words research essay
  • Assessment Deadline: First week of term 3.

Course Description

Some of the most interesting and important concepts in the social sciences are observable predominantly (or sometimes even exclusively) in written form. This is because much of social life occurs through the language that we use: laws are written; speeches are spoken; historical events are transcribed; correspondence is shared; and so on. Historically, quantitative social scientists have spent relatively little time analyzing the texts produced as the output of these social processes, for two main reasons. First, for many research questions focused on social texts, researchers only had access to relatively small numbers of documents, meaning that qualitative approaches were both attractive and feasible for such analyses. Second, when large collections of documents were available, analysing those texts tended to be expensive and technically challenging.

In recent years, however, the volume of text-based data available to social scientists has proliferated at an extraordinary rate, largely thanks to the huge collections of texts that have been made available online. The increasing availability of digitized texts has also prompted social scientists to develop a wide array of new methods that can be used to analyse and (we hope) extract meaning from those texts. These two changes – the increasing availability of digitized text data, and the rapid pace of methodological development – have provoked an enormous amount of research in the field of “quantitative text analysis”, or “text-as-data”.

This course provides an overview of text-as-data methods for social science students. The key goals of the course are: to introduce the foundational models and approaches used to analyse large-scale collections of texts in modern social science; to develop students’ abilities to critically evaluate existing text-as-data work in the discipline; and to provide the practical skills required to conduct an original research project which uses quantitative text analysis methods.

Throughout the course, we will think deeply about the things we can (and cannot) measure reliably through the quantitative analysis of text; discover how treating text as data necessitates making assumptions, which can be consequential; learn tools to collect and manipulate large collections of text; and develop a suite of practical computational skills to apply text-as-data analyses to data of widely varying forms. Throughout the course, we will cover a wide variety of topics and examples from political science, economics, and public policy.

Pre-requisites

Students should have a working knowledge of the methods covered in typical introductory quantitative methods courses (i.e. to the level of PUBL0055 or equivalent). At a minimum, this should include hypothesis testing and multiple linear regression. You will need to provide me with evidence of having completed at least one prior course that covers this material.

Students who have not taken PUBL0055 earlier in the year may wish to refresh their knowledge before starting this course. A good resource in this regard is the following:

  • Imai, Kosuke. 2017. Quantitative Social Science: An Introduction, Princeton University Press.

Learning, Assessment and Feedback

Learning Outcomes

By the end of the course, students should be able to:

  1. Understand the foundational models and approaches used to analyse large-scale collections of texts in modern social science
  2. Critically evaluate existing text-as-data work in the social sciences
  3. Understand, apply and interpret a variety of quantitative methods for text data

These outcomes are indicative of the kinds of knowledge that should be demonstrated on the summative assessment.

Teaching Format

Teaching delivery will be split into lectures and seminars. Note that, in addition to the below, I will hold student support and feedback hours each week where you will be able to ask additional questions.

  1. Lectures

All of the main course content will be delivered in 2-hour lectures which will be delivered once a week. You are expected to attend all lectures.

  1. Seminars

This is a practical module, and a key learning objective is for students to be able to implement the statistical methods we cover during lectures to real data. Each week, you will complete a problem set which involves writing code in the R programming language (see below for more details) and interpreting the results.

The goal of these seminars is to provide you with ample time to ask questions about the problem set, and particular issues that relate to coding in R. During your allocated seminar time, you will be able to ask questions of the teacher; speak with other students about the problem set; and watch short live demonstrations from your seminar teacher. You will also be able to use Moodle to log questions for teaching staff or other students to answer outside of those allocated hours as well. Attendance during these seminar hours is mandatory and we will take a register at the beginning of the session.

Please note that we expect you to have made some attempt to answer the questions in the seminar materials before attending the seminar each week. This will make the seminars themselves much more productive.

Assessment

Students will be evaluated through a 3000-word essay applying the methods from the course to a research question chosen by the student.

  • Part 1 (30%): a 1000 word review of an existing text-as-data application in a social science literature of your choice.

  • Part 2 (70%): a 2000 word research paper. For UG students, this involves formulating a research question, describing the methodological approach you will use, and identifying a suitable corpus on which the analysis would be conducted. For PG students, you will do all of the above but will also actually analyse the text data that you collect in order to provide an answer to your research question.

More detail on both parts of the assessment can be found on the Assessment Guidelines page of the website.

For formative work, students will also complete short “homework” assignments each week, which allow them to apply material from the course to concrete examples. These formative assessments provide an opportunity for peer and instructor feedback and play a small role in determining your final grade. Student who submit at least 8 homework assignments will be granted 10 extra points in the final mark.

All assignments will be available on the course website and annotated solutions will be released (also via the course website) on the next lecture day. Students will be also asked to submit their completed assignments before the next next lecture day, so that common problems can be discussed in the lecture/seminars.

Course Materials

Online resources

  • Course website: The main source of information for lecture recordings, lecture notes, quizzes, problem sets, and readings will be the course website.
  • Moodle: Other material relevant to the course will be uploaded to the course Moodle site

Textbooks and Readings

A full reading list for each week of the module can be found on the Schedule page of the website.

We primarily use the following textbook on this course:

  1. Text As Data: A New Framework for Machine Learning and the Social Sciences, Justin Grimmer, Margaret E. Roberts and Brandon M. Stewart, Princeton University Press, 2022

Students should read the articles set as “required” reading each week, and it is worth familiarising yourself also with at least some of the “recommended” reading. The required reading will often contain material that is not covered in the textbook, partly because the methods on this course are at the cutting edge of the discipline and so are (sometimes) too new to have received coverage in textbooks and (often) it is more interesting to read the papers than the book.

The “recommended” readings will typically cover recent or important implementations of the methods we will learn about, and will be helpful in (at least) two regards. First, reading these articles will provide you with an understanding of when the methods we study can provide interesting answers to previously thorny empirical questions. Second, these articles will be helpful templates for the research and review papers that you will write at the conclusion of the course.

Software

Throughout the course we will use the free and open source statistical analysis software R. Before the course starts, you can and should download and install R on your personal computer. You should also also download and install RStudio, which is a user-interface to R. Please ensure that both R and RStudio are installed on your personal computers before the first lecture. UCL machines, either virtual via Desktop\(@\)UCL or on campus, will already have this software installed.

Students are not expected to have programming knowledge before starting class, and the computer labs will be centered around assignments which will help build knowledge of and intuition for coding in R. We will, however, move relatively fast in terms of programming on this course, so if this is an area where you have less experience then it might be worth doing some preparatory work first. Feel free to get in touch with Jack to ask for suggestions.

You will be provided with all the relevant code necessary for completing the class assignments and problem sets each week (you will also be provided with solution code for the problem sets). That said, what you get out of your experience with R in this course really will be a function of what you put into learning it. With that in mind, I’d recommend that everyone who is serious about doing well on the course spend at least a little time familiarising yourself with R in advance.

Academic Freedom

Academic freedom is the cornerstone of university research and teaching, so that all university staff, speakers, and students can freely explore questions and ideas and challenge perceived views and opinions, without being censored or harassed by a government, any state authorities, the University, other students, or external pressure groups. As part of the UCL academic community, all staff, speakers, and students share these responsibilities:

  • Everyone must respect freedom of thought and freedom of expression. Your lecturer will not limit what can be discussed in the seminar, as long as it is relevant to the subject. They will not censor any topics, and they will expose you to controversial issues, questions, facts, views, and debates.

    • You may disagree with some facts or views that you read or hear in the classroom. You are encouraged to engage with these facts and views in a respectful manner.
    • Your lecturer will not penalise you merely for expressing views they or other students disagree with. However, they will expect you to present logical arguments supported by evidence.
  • You are explicitly prohibited from recording, publishing, distributing or transferring any class material/content, in whole or in part, in any format, to any individual or entity outside the module, linking to or posting it online (including social media), or making it otherwise available to any person or entity outside the module, unless you have received prior specific written approval from the module leader. You are also explicitly prohibited from aiding or abetting in any of these actions. Similarly, your lecturer will not record, publish or distribute seminar sessions without the explicit consent of the participants.

  • By agreeing to take this module, you agree to abide by these terms. If you do not comply with these terms, you will potentially be subject to disciplinary actions similar to those under violations of the university Student Code of Conduct.

Use of Artificial Intelligence

All research and writing for this module must be produced by the student. If you have another person or an AI do even parts of the research or the writing for you, this is considered academic misconduct. Any suspected Academic Misconduct will be investigated and can ultimately lead to module failure.

In this module, you are authorised to seek assistance from AI tools to edit your writing. Editing includes fixing typos and grammatical errors, clarifying the message of a sentence or paragraph, or cutting down text. If you choose to use AI tools to edit your writing, acknowledge your use by naming the AI tool and describing how you used it in a footnote. You should also acknowledge if another person edited your writing.