Topic models are generative models that can assign probabilities to text.
Topic models allow us to cluster similar documents in a corpus together.
Don’t we already have tools for that?
Yes! Dictionaries and supervised learning.
So what do topic models add?
Data: UK House of Commons’ debates (PMQs)
Sample/feature selection decisions
Topic models offer an automated procedure for discovering the main “themes” in an unstructured corpus
They require no prior information, training set, or labeling of texts before estimation
They allow us to automatically organise, understand, and summarise large archives of text data.
Latent Dirichlet Allocation (LDA) is the most common approach (Blei et al., 2003), and one that underpins more complex models
Topic models are an example of mixture models:
Not a mixture model
Mixture model
Last week, we introduced the idea of a probabilistic language model
A language model is represented by a probability distribution over words in a vocabulary
The Naive Bayes text classification model is one example of a generative language model where
Topic models are also language models
Decision: Politics ✓
Document is: All three!
A “topic” is a probability distribution over a fixed word vocabulary.
A ‘word frequency profile’ - it tells you which words are typical (high probability) and which are unusual (low probability) for that topic
Consider a vocabulary: gene, dna, genetic, data, number, computer
When speaking about genetics, you will:
When speaking about computation, you will:
| Topic | gene | dna | genetic | data | number | computer |
|---|---|---|---|---|---|---|
| Genetics (probably) | 0.4 | 0.25 | 0.3 | 0.02 | 0.02 | 0.01 |
| Computation (probably) | 0.02 | 0.01 | 0.02 | 0.3 | 0.4 | 0.25 |
Note that no word has probability of exactly 0 under either topic.
In a topic model, each document is described as being composed of a mixture of corpus-wide topics
For each document, we find the topic proportions that maximize the probability that we would observe the words in that particular document
Imagine we have two documents with the following word counts
| Topic | gene | dna | genetic | data | number | computer |
|---|---|---|---|---|---|---|
| Genetics | 0.4 | 0.25 | 0.3 | 0.02 | 0.02 | 0.01 |
| Computation | 0.02 | 0.01 | 0.02 | 0.3 | 0.4 | 0.25 |
Let’s calculate the probability that a document would be drawn from each distribution/model => \(P(W_i|\mu)\)
What is the probability of observing Document A’s word counts under the “Genetics” topic?
\[\begin{eqnarray} P(W_A|\mu_{\text{Genetics}}) &=& \frac{M_A!}{\prod_{j=1}^JW_{A,j}!}\prod_{j=1}^J\mu_{\text{Genetics},j}^{W_{A,j}} \\ &=& 0.0000000798336 \end{eqnarray}\]
What is the probability of observing Document A’s word counts under the “Computation” topic?
\[\begin{eqnarray} P(W_A|\mu_{\text{Computation}}) &=& \frac{M_A!}{\prod_{j=1}^JW_{A,j}!}\prod_{j=1}^J\mu_{\text{Computation},j}^{W_{A,j}} \\ &=& 0.0000000287401 \end{eqnarray}\]
What is the probability of observing Document A’s word counts under a equal mixture of the both topics?
Note here we don’t calculate the probabilities separately for each category
\[\begin{eqnarray} P(W_A|\mu_{\text{Comp+Genet}}) &=& \frac{M_A!}{\prod_{j=1}^JW_{A,j}!}\prod_{j=1}^J\mu_{\text{Comp+Genet},j}^{W_{A,j}} \\ &=& 0.001210891 \end{eqnarray}\]
(0.5* mu_comp +0.5* mu_genet)What is the probability of observing Document B’s word counts under the “Genetics” topic?
\[\begin{eqnarray} P(W_B|\mu_{\text{Genetics}}) &=& \frac{M_B!}{\prod_{j=1}^JW_{B,j}!}\prod_{j=1}^J\mu_{\text{Genetics},j}^{W_{B,j}} \\ &=& 0.0000112266 \end{eqnarray}\]
What is the probability of observing Document B’s word counts under the “Computation” topic?
\[\begin{eqnarray} P(W_B|\mu_{\text{Computation}}) &=& \frac{M_B!}{\prod_{j=1}^JW_{B,j}!}\prod_{j=1}^J\mu_{\text{Computation},j}^{W_{B,j}} \\ &=& 0.00000000004790016 \end{eqnarray}\]
What is the probability of observing Document B’s word counts under a equal mixture of the both topics?
\[\begin{eqnarray} P(W_B|\mu_{\text{Comp+Genet}}) &=& \frac{M_B!}{\prod_{j=1}^JW_{B,j}!}\prod_{j=1}^J\mu_{\text{Comp+Genet},j}^{W_{B,j}} \\ &=& 0.0007378866 \end{eqnarray}\]
What is the probability of observing Document B’s word counts under a 60-40 mixture of the both topics?
\[\begin{eqnarray} P(W_B|\mu_{\text{Comp+Genet}}) &=& \frac{M_i!}{\prod_{j=1}^JW_{B,j}!}\prod_{j=1}^J\mu_{\text{Comp+Genet},j}^{W_{B,j}} \\ &=& 0.001262625 \end{eqnarray}\]
Implication: Our documents may be better described in terms of mixtures of different topics than by one topic alone.
A topic model simultaneously estimates two sets of probabilities
The probability of observing each word for each topic
| Topic | gene | dna | genetic | data | number | computer |
|---|---|---|---|---|---|---|
| Genetics | 0.4 | 0.25 | 0.3 | 0.02 | 0.02 | 0.01 |
| Computation | 0.02 | 0.01 | 0.02 | 0.3 | 0.4 | 0.25 |
The probability of observing each topic in each document
| Document | Genetics | Computation |
|---|---|---|
| A | 0.70 | 0.30 |
| B | 0.35 | 0.65 |
These quantities can then be used to organise documents by topic, assess how topics vary across documents, etc.
LDA is a probabilistic language model.
LDA assumes:
However, we only observe documents!
The goal of LDA is to estimate hidden parameters (\(\color{blue} \beta\) and \(\color{blue} \theta\)) starting from observed words \(\color{blue} w\).
What does “Dirichlet” mean here?
Suppose we have 3 topics (Politics, Sports, Tech).
Each document needs a recipe - how much of each topic to use. These proportions must sum to 1., e.g.:
- (0.80, 0.10, 0.10) - document is mostly about one topic
- (0.33, 0.33, 0.34) - document evenly mixes topics
Each of these vectors is a probability distribution (infinite) over topics.
A Dirichlet distribution answers: How likely are these different probability distributions?
We set expectations before we see any documents.
Example:
- I expect focused documents; Dirichlet(0.1,0.1,0.1): sparse, single-topic documents likely; (0.80, 0.10, 0.10) this is more likely - I expect mixed documents: Dirichlet(10,10,10): balanced, multi-topic documents likely; (0.33, 0.33, 0.34) this more likely
So: Dirichlet = distribution over probability distributions | Multinomial = draws outcomes given probabilities
The multinomial distribution is a probability distribution describing the results of a random variable that can take on one of K possible categories
The multinomial distribution depicted has probabilities \([0.2, 0.7, 0.1]\)
A draw (of size one) from a multinomial distribution returns one of the categories of the distribution
A draw of a larger size from a multinomial distribution returns several categories of the distribution in proportion to their probabilities
We have seen this before! Naive Bayes uses the multinomial distribution to describe the probability of observing words in different categories of documents.
Step 1: We have Football’s word distribution
goal: 30% ████████
team: 25% ██████
player: 20% █████
score: 15% ████
computer: 1% █
gene: 1% █
Step 2: Draw 10 words from this distribution
Each draw is independent:
Word 1 → goal | Word 2 → team | Word 3 → goal | Word 4 → player | Word 5 → score
Word 6 → goal | Word 7 → team | Word 8 → player | Word 9 → goal | Word 10 → team
Step 3: Resulting document
“goal team goal player score goal team player goal team”
Notice: High-probability words (goal, team, player) appear more frequently!
What Dirichlet does:
The Dirichlet distribution is a distribution over the simplex, i.e., positive vectors that sum to one
A draw from a dirichlet distribution returns a vector of positive numbers that sum to one
In other words, we can think of draws from a Dirichlet distribution being themselves multinomial distributions
Parameter \(\alpha\) controls sparsity:
How Dirichlet works in LDA: The Learning Process
Dirichlet’s role (α=0.1):
- Prior expectation: “Documents should be sparse (focus on few topics)”
- Doesn’t know content - just shapes how documents mix topics
During LDA Inference:
- LDA notices topic co-occurrences in docs
- Combines: Dirichlet’s sparsity preference + observed co-occurrences
- Output: Documents with focused topic mixture
After LDA:
- Doc 1: topic1(0.30), topic2(0.25), topic3(0.20)…
- Humans interpret
p = [0.21,0.5,0.28]
p = [0.22,0.6,0.18]
p = [0.37,0.2,0.43]
LDA assumes a generative process for documents:
Create Topics:Each topic is a probability distribution \(\beta_{k}\) over words
Create Document’s Topic Mixture : For each document, draw a probability distribution \(\theta_{d}\) over topics
Generate Each Word: For each word in each document
Draw one of \(K\) topics from the distribution over topics \(\theta_d\)
Given \(z_i\), draw one of \(N\) words from the distribution over words \(\beta_k\)
This space shows all possible ways to mix words from our vocabulary
K=3, each topic is a point in the WORD simplex. Any point inside this triangle is some mixture of these three topics
Now we generate documents. Each document is a point in the TOPIC simplex:
Figure 2: Complete pooling
Note: \(\eta\) and \(\alpha\) govern the of the draws from the dirichlet. As they \(\rightarrow 0\), the multinomials become more sparse.
From a collection of documents, infer
Then use estimates of these parameters to perform the task at hand \(\rightarrow\) information retrieval, document similarity, exploration, and others.
Assuming the documents have been generated in such a way, in return makes it possible to back out the shares of topics within documents and the share of words within topics
Estimation of the LDA model is done in a Bayesian framework
Our \(Dir(\alpha)\) and \(Dir(\eta)\) are the prior distributions of the \(\theta_d\) and \(\beta_k\)
We use Bayes’ rule to update these prior distributions to obtain a posterior distribution for each \(\theta_d\) and \(\beta_k\)
The means of these posterior distributions are the outputs of statistical packages and which we use to investigate the \(\theta_d\) and \(\beta_k\)
Estimation is performed using either collapsed Gibbs sampling or variational methods
Fortunately, for us these are easily implemented in R
LDA trades off two goals.
These goals are at odds.
Putting a document in a single topic makes (2) hard: All of its words must have probability under that topic.
Putting very few words in each topic makes (1) hard: To cover a document’s words, it must assign many topics to it.
Trading off these goals finds groups of tightly co-occurring words
Imagine we have \(D = 1000\) documents, \(J = 10,000\) words, and \(K = 3\) topics.
The key outputs of the topic model are the \(\beta\) and \(\theta\) matrices:
\[\begin{equation} \theta = \underbrace{\begin{pmatrix} \theta_{1,1} & \theta_{1,2} & \theta_{1,3}\\ \theta_{2,1} & \theta_{2,2} & \theta_{2,3}\\ ... & ... & ...\\ \theta_{D,1} & \theta_{D,2} & \theta_{D,3}\\ \end{pmatrix}}_{D\times K} = \underbrace{\begin{pmatrix} 0.7 & 0.2 & 0.1\\ 0.1 & 0.8 & 0.1\\ ... & ... & ...\\ 0.3 & 0.3 & 0.4\\ \end{pmatrix}}_{1000 \times 3} \end{equation}\]
\[\begin{equation} \beta = \underbrace{\begin{pmatrix} \beta_{1,1} & \beta_{1,2} & ... & \beta_{1,J}\\ \beta_{2,1} & \beta_{2,2} & ... & \beta_{2,J}\\ \beta_{3,1} & \beta_{3,2} & ... & \beta_{3,J}\\ \end{pmatrix}}_{K\times J} = \underbrace{\begin{pmatrix} 0.04 & 0.0001 & ... & 0.003\\ 0.0004 & 0.001 & ... & 0.00005\\ 0.002 & 0.0003 & ... & 0.0008\\ \end{pmatrix}}_{3 \times 10,000} \end{equation}\]
\(\theta\) is D × K
- Each row = one document’s topic mixture
\(\beta\) is K × J
- Each row = one topic’s word distribution
Data: UK House of Commons’ debates (PMQs)
Rows: 27,885
Columns: 4
$ name <chr> "Ian Bruce", "Tony Blair", "Denis MacShane", "Tony Blair"…
$ party <chr> "Conservative", "Labour", "Labour", "Labour", "Liberal De…
$ constituency <chr> "South Dorset", "Sedgefield", "Rotherham", "Sedgefield", …
$ body <chr> "In a written answer, the Treasury has just it made clear…
topicmodels packagelibrary(quanteda)
library(topicmodels)
## Create corpus
pmq_corpus <- pmq %>%
corpus(text_field = "body")
pmq_dfm <- pmq_corpus %>%
tokens(remove_punct = TRUE) %>%
dfm() %>%
dfm_remove(stopwords("en")) %>%
dfm_wordstem() %>%
dfm_trim(min_termfreq = 5)
## Convert for usage in 'topicmodels' package
pmq_tm_dfm <- pmq_dfm %>%
convert(to = 'topicmodels')We will make use of the following score to visualise the posterior topics:
\[ \text{term-score}_{k,v} = \hat{\beta}_{k,v}\log\left(\frac{\hat{\beta}_{k,v}}{(\prod_{j=1}^{K}\hat{\beta}_{j,v})^{\frac{1}{K}}}\right) \]
This score ranks words that are common within a topic but rare across other topics:
• “the”, “and”, “people”, “said”
• words that appear in every topic
So we need a score that says: Is this word distinctive for this topic?
This formulation is akin to the TFIDF term score
# Extract estimated topic–word probabilities beta
topics <- tidy(ldaOut, matrix = "beta")
# Compute term scores to identify distinctive words per topic
#similar to euqation on previous slide
top_terms <- topics %>%
#group by word (term)
group_by(term) %>%
#This downweights words that are common across many topics
mutate(beta_k = prod(beta)^(1/20)) %>%
#remove grouping by term
ungroup() %>%
#compute the term score
mutate(term_score = beta*log(beta/(beta_k))) %>%
#group by topic: We now want the most distinctive words PER topic
group_by(topic) %>%
#keep the top 10 highest-scoring words per topic
slice_max(term_score, n = 10)
# Extract the terms with the largest scores per topic
top_terms$term[top_terms$topic==3] [1] "entent" "jospin" "banff"
[4] "buchan" "courtesi" "till"
[7] "guillotin" "59" "greg"
[10] "abhor" "impetus" "wive"
[13] "beforehand" "advert" "walton"
[16] "unsuccess" "nine-year-old" "opencast"
[19] "3.30" "dotti" "intrud"
[22] "infer" "favourit" "reconfirm"
[25] "unfail" "courteous" "knock-on"
[28] "ruthless" "doubli" "pact"
[31] "inclin" "dilapid" "unansw"
[34] "mayhem" "10,500" "scotsman"
[37] "ira-sinn" "22,000" "novel"
[40] "imparti" "perri" "impend"
[43] "ftse" "cream" "verg"
[46] "underwrit" "brunt" "inconceiv"
[49] "overh" "reign" "bias"
[52] "ill-thought-out" "patronis" "symptom"
[55] "hyde" "misunderstood" "102"
[58] "1988" "11-plus" "even-hand"
[61] "intransig" "obscur" "likewis"
[64] "summaris" "sabotag" "uppermost"
[67] "coloni" "swan" "california"
[70] "runner" "repudi" "measl"
[73] "immunis" "delic" "reintroduct"
[76] "spongiform" "encephalopathi" "dust"
[79] "81" "imped" "centrepiec"
[82] "unveil" "nestl" "maff"
[85] "hygien" "deed" "post-16"
[88] "non-exist" "mickey" "mous"
[91] "grant-maintain" "postgradu" "unten"
[94] "indonesia" "hindsight" "jersey"
[97] "breathtak" "foul" "slur"
[100] "proprieti" "leaf" "duall"
[103] "al" "coloss" "sewerag"
[106] "115" "logjam" "roy"
[109] "steep" "howarth" "thereaft"
[112] "authorit" "well-deserv" "intrigu"
[115] "harden" "anti-drug" "presumptu"
[118] "strasbourg" "crackdown" "salisburi"
[121] "anti-personnel" "ottawa" "ratepay"
[124] "oslo" "fist" "perpetu"
[127] "winchest" "enact" "dam"
[130] "unresolv" "prodi" "subsidiar"
[133] "whitbi" "angl" "strangl"
[136] "boot" "450,000" "hoard"
[139] "merri" "shabbi" "luxembourg"
[142] "unconvinc" "1972" "newark"
[145] "out-of-school" "berni" "skin"
[148] "hair" "silli" "one-n"
[151] "goldsmith" "mishandl" "first-past-the-post"
[154] "rjb" "budg" "chronicl"
[157] "tessa" "pep" "reliant"
[160] "moorland" "soil" "withheld"
[163] "distast" "cake" "190"
[166] "guard" "scorn" "deris"
[169] "adam" "semtex" "non-viol"
[172] "downsiz" "shower" "vitamin"
[175] "nanni" "exhort" "fianc"
[178] "honeymoon" "conform" "bleak"
[181] "improp" "fierc" "rival"
[184] "flu" "270,000" "rigour"
[187] "neutral" "ampl" "spurious"
[190] "philosophi" "worst-off" "beveridg"
[193] "contributori" "affluent" "unnam"
[196] "caller" "v" "underachiev"
[199] "650,000" "crude" "bland"
[202] "fanfar" "misinterpret" "marathon"
[205] "twenty-six" "ill-found" "gillingham"
[208] "richer" "undercut" "bawl"
[211] "lenient" "adventur" "lest"
[214] "cardin" "back-door" "re-emphasis"
[217] "censorship" "coe" "owner-occupi"
[220] "brown-field" "proprietor" "tin"
[223] "affili" "tebbit" "mourn"
[226] "unduli" "ice" "senseless"
[229] "newri" "derail" "vice-chancellor"
[232] "tranquil" "herd" "new-found"
[235] "tertiari" "trifl" "unsubsidis"
[238] "telecommun" "1945" "magnet"
[241] "concili" "severest" "mixtur"
[244] "1981" "editori" "sober"
[247] "8.5" "shire" "sean"
[250] "communist" "girlfriend" "directorship"
[253] "out-of-touch" "perish" "3,800"
[256] "eccl" "descent" "double-digit"
[259] "portray" "high-profil" "creed"
[262] "tram" "hay" "chef"
[265] "resound" "ssa" "#8211;98"
[268] "ilford" "netanyahu" "arafat"
[271] "trimbl" "arthur" "precept"
[274] "unforeseen" "desist" "linear"
[277] "intermedi" "reactor" "founder"
[280] "disarray" "famin" "philosoph"
[283] "heinous" "besid" "winterton"
[286] "rutland" "avon" "freak"
[289] "covert" "contravent" "playgroup"
[292] "2nd" "battalion" "regiment"
[295] "infantri" "aberdeenshir" "unicef"
[298] "drought" "nutrit" "shatter"
[301] "spectacl" "dad" "captain"
[304] "sub-contin" "heroic" "coup"
[307] "sandlin" "thoma" "hms"
[310] "selfless" "squadron" "reintegr"
[313] "communiqu" "strenuous" "refrain"
[316] "18-month" "juggl" "86,000"
[319] "forgiven" "duty-fre" "attle"
[322] "legg" "warmest" "nearbi"
[325] "endow" "lamont" "defici"
[328] "renfrewshir" "chesham" "amersham"
[331] "1973" "discrep" "mcguin"
[334] "weaponri" "commenc" "in-built"
[337] "mindless" "ethiopia" "greet"
[340] "hard-hit" "luxuri" "compuls"
[343] "culmin" "hurdl" "gill"
[346] "dishonest" "license" "murphi"
[349] "hat" "luca" "foresight"
[352] "barrist" "repetit" "prestwick"
[355] "two-week" "scan" "carr"
[358] "jon" "flatter" "grimethorp"
[361] "7.5" "rpi" "this-that"
[364] "33,000" "absolv" "unspecifi"
[367] "reprimand" "gag" "ordnanc"
[370] "kuwait" "fend" "fraught"
[373] "backyard" "grang" "orang"
[376] "leas" "award-win" "bevan"
[379] "#8211;99" "southend-on-sea" "preferenti"
[382] "liddl" "cocktail" "emphat"
[385] "disprov" "chunter" "cadet"
[388] "continent" "wellington" "redeem"
[391] "danc" "0" "maud"
[394] "chest" "priest" "cross-department"
[397] "contend" "3,300" "restrain"
[400] "preoccupi" "intak" "brush"
[403] "ernst" "pretenc" "asda"
[406] "intact" "accomplish" "vagu"
[409] "griffith" "joseph" "motabl"
[412] "paddi" "gerrymand" "nspcc"
[415] "journal" "blatter" "pinochet"
[418] "crass" "chile" "saniti"
[421] "bless" "pin" "lancast"
[424] "advent" "sand" "armistic"
[427] "mathemat" "wye" "uncheck"
[430] "swiss" "newcastle-under-lym" "pre-prepar"
[433] "undecid" "unrealist" "newhaven"
[436] "jackson" "skew" "temptat"
[439] "ex-min" "asbestosi" "slide"
[442] "thrash" "gratuit" "hammer"
[445] "portillo" "wrap" "belov"
[448] "bat" "monstros" "variant"
[451] "winston" "disown" "world-beat"
[454] "dublin" "complimentari" "peugeot"
[457] "talbot" "bonn" "resembl"
[460] "sampl" "recip" "lafontain"
[463] "crook" "hard-won" "under-18"
[466] "demograph" "trophi" "inflam"
[469] "flush" "0.8" "grind"
[472] "hard-earn" "upfront" "pantomim"
[475] "feud" "beleagu" "insensit"
[478] "inher" "anti-terrorist" "jean"
[481] "mid-1990" "maladministr" "earthquak"
[484] "rubbl" "inquir" "325"
[487] "imperfect" "barn" "luci"
[490] "agonis" "respiratori" "65,000"
[493] "work-rel" "egg" "thirti"
[496] "18th" "112" "rice"
[499] "peace-lov" "bully-boy" "taint"
[502] "ramp" "mening" "cavali"
[505] "bug" "child-car" "manor"
[508] "lynn" "ludlow" "marin"
[511] "assessor" "104" "macpherson"
[514] "burnt" "omagh" "abhorr"
[517] "lanc" "arsenal" "artilleri"
[520] "breweri" "nicholson" "cousin"
[523] "sail" "dive" "oscar"
[526] "genius" "sub" "eurobond"
[529] "conscious" "miracl" "sandwel"
[532] "sphere" "physiotherapist" "ground-break"
[535] "omiss" "grass" "bangladeshi"
[538] "planner" "chess" "kosovan"
[541] "indiscrimin" "flagrant" "specialti"
[544] "slap" "rear" "heartfelt"
[547] "subsidiari" "46,000" "rationalis"
[550] "13th" "posh" "wrench"
[553] "guthri" "snow" "old-ag"
[556] "silver" "diversifi" "militarili"
[559] "72,000" "gloom" "paulin"
[562] "nurtur" "telford" "wrekin"
[565] "toilet" "undertook" "parrot"
[568] "alexandra" "ill-conceiv" "groundbreak"
[571] "beliz" "schr" "#246"
[574] "der" "m4" "third-world"
[577] "plc" "energy-intens" "mid-essex"
[580] "lothian" "malnutrit" "downey"
[583] "thabo" "nelson" "mandela"
[586] "poultri" "verifi" "sequenc"
[589] "no-on" "gallant" "regi"
[592] "loot" "tangibl" "requisit"
[595] "mencap" "wythenshaw" "garbag"
[598] "79" "34,000" "payabl"
[601] "g" "indict" "62,000"
[604] "2,200" "roger" "junctur"
[607] "steward" "early-year" "tuberculosi"
[610] "depreci" "default" "bowl"
[613] "haslar" "sigh" "rip-off"
[616] "1.9" "kempston" "89"
[619] "sutherland" "breadth" "tripartit"
[622] "apprehens" "summon" "streamlin"
[625] "thatcherit" "pork" "lamb"
[628] "ryedal" "320" "nepal"
[631] "gratuiti" "sergeant" "self-def"
[634] "pier" "pembrokeshir" "torbay"
[637] "cyclon" "cave" "leven"
[640] "jeff" "discriminatori" "hectar"
[643] "greenbelt" "whisper" "entrepreneurship"
[646] "slice" "performance-rel" "jeffrey"
[649] "archer" "analyst" "wilberforc"
[652] "knaresborough" "proactiv" "sinist"
[655] "chelsea" "deem" "compris"
[658] "cheat" "spotlight" "indefens"
[661] "illiter" "woodspr" "non-urg"
[664] "compound" "rowntre" "fabul"
[667] "mentor" "84" "175,000"
[670] "ravag" "undo" "untold"
[673] "revolutionis" "shiver" "saatchi"
[676] "bogl" "2,600" "starter"
[679] "midwiferi" "resurg" "dishonour"
[682] "casual" "87" "110,000"
[685] "homosexu" "henley" "ash"
[688] "pamphlet" "taxi" "sarwar"
[691] "36,000" "23,000" "slough"
[694] "unknown" "malaysia" "pro-european"
[697] "refit" "erod" "colvin"
[700] "cashpoint" "nichola" "outpati"
[703] "how" "spearhead" "69"
[706] "raft" "pancra" "amput"
[709] "hook" "maker" "4.7"
[712] "ppp" "101" "processor"
[715] "4,200" "threw" "beggar"
[718] "take-hom" "tighter" "catherin"
[721] "sittingbourn" "sheppey" "acquaint"
[724] "dialysi" "lichfield" "slim"
[727] "rumbl" "smug" "deduct"
[730] "photo" "orchestr" "bombshel"
[733] "philip" "evesham" "webb"
[736] "tabloid" "dawn" "acclaim"
[739] "lossiemouth" "patten" "kurdish"
[742] "devis" "austria" "114"
[745] "pusher" "cultiv" "fascist"
[748] "heel" "stricter" "avenu"
[751] "second-class" "xenophobia" "cenotaph"
[754] "wretch" "peel" "doubtless"
[757] "crackpot" "unquestion" "dagenham"
[760] "uptak" "diet" "appledor"
[763] "revolutionari" "chop" "carp"
[766] "knee-jerk" "rattl" "3.2"
[769] "gla" "minework" "debacl"
[772] "check-up" "m6" "heysham"
[775] "fairfield" "surest" "branson"
[778] "exorbit" "drunken" "metaphor"
[781] "fuller" "disquiet" "longev"
[784] "traumat" "geograph" "outer"
[787] "penc" "gallon" "eye-catch"
[790] "alistair" "scott" "well-run"
[793] "whing" "state-own" "misgiv"
[796] "lag" "splendid" "scotch"
[799] "adrift" "knowsley" "thwart"
[802] "flew" "dewar" "credenti"
[805] "standstil" "geographi" "cleveland"
[808] "part-privatis" "destitut" "twickenham"
[811] "blockag" "cjd" "impair"
[814] "gmtv" "sore" "landmark"
[817] "non-elect" "case-by-cas" "fortitud"
[820] "korean" "replica" "mock"
[823] "tendenc" "660" "super-st"
[826] "singular" "mortar" "solvent"
[829] "mate" "etern" "kidney"
[832] "3.4" "kintyr" "gape"
[835] "revolt" "19th" "pothol"
[838] "float" "unpredict" "78"
[841] "non-clin" "phillip" "santa"
[844] "sell-off" "jason" "jeanett"
[847] "medallist" "tanni" "amber"
[850] "frequenc" "sieg" "be"
[853] "mitcham" "hastilow" "testament"
[856] "neath" "hain" "mischief"
[859] "anna" "unimagin" "hillingdon"
[862] "lever" "prosser" "wet"
[865] "overtaken" "novic" "qc"
[868] "bolster" "beet" "commod"
[871] "capitalis" "rifl" "alder"
[874] "hey" "lump" "offhand"
[877] "sub-saharan" "dine" "dishonesti"
[880] "hinduja" "46" "clever"
[883] "indign" "bureaux" "agrimonetari"
[886] "disinfect" "suez" "decor"
[889] "12-year-old" "groom" "entic"
[892] "broxtow" "lit" "apprehend"
[895] "bravest" "influenti" "opt-in"
[898] "gilt" "ewe" "slaughtermen"
[901] "valuat" "squabbl" "lab"
[904] "carcas" "118" "decision-mak"
[907] "sinew" "brigadi" "reservoir"
[910] "counter-product" "undervalu" "91"
[913] "song" "consequenti" "christoph"
[916] "arc" "largest-ev" "eleventh"
[919] "cod" "brad" "dispens"
[922] "wakeham" "manslaught" "sandwich"
[925] "funder" "eden" "hideous"
[928] "coln" "bursari" "normalis"
[931] "anti-ballist" "radar" "apprais"
[934] "tidi" "stifl" "nantwich"
[937] "pop" "humour" "virtu"
[940] "commiser" "hendon" "irrevers"
[943] "forbidden" "jami" "satellit"
[946] "power-shar" "unravel" "belat"
[949] "succour" "lake" "cornish"
[952] "rosi" "strand" "open-end"
[955] "uneas" "explod" "pleasant"
[958] "tobin" "penros" "mercia"
[961] "fivefold" "brian" "11th"
[964] "haemophilia" "recombin" "byer"
[967] "stand-bi" "euroscept" "48-hour"
[970] "astound" "tewkesburi" "robertson"
[973] "lion" "consul" "63"
[976] "leukaemia" "irrit" "twofold"
[979] "movi" "overhead" "outlook"
[982] "whipp" "de-escal" "crush"
[985] "follow-up" "mmr" "jab"
[988] "13-year-old" "rmt" "hamper"
[991] "ivan" "renationalis" "mexico"
[994] "norm" "unsurpris" "tokyo"
[997] "three-star" "parachut" "portland"
[1000] "lesser" "quinn" "fridg"
[1003] "incept" "harrison" "christin"
[1006] "unrel" "londonderri" "jamaica"
[1009] "clot" "swallow" "knot"
[1012] "striker" "baton" "relay"
[1015] "wellcom" "romanian" "mittal"
[1018] "ronald" "berlusconi" "above-infl"
[1021] "northwick" "torn" "non-uk"
[1024] "private-sector" "weston" "coffe"
[1027] "invok" "inadvert" "busiest"
[1030] "ill-judg" "unreform" "budd"
[1033] "picker" "texa" "rejoic"
[1036] "greenpeac" "slower" "sandi"
[1039] "ramallah" "caf" "heartless"
[1042] "relentless" "strictest" "half-bak"
[1045] "prosecutor" "recreat" "cuban"
[1048] "out-of-control" "elit" "corp"
[1051] "wealthier" "detract" "osborn"
[1054] "minder" "linda" "cymru"
[1057] "piraci" "copyright" "50s"
[1060] "likelihood" "seal" "blairit"
[1063] "lew" "implicit" "turnbul"
[1066] "dragoon" "sevill" "paltri"
[1069] "inch" "pornographi" "marshal"
[1072] "purs" "day-cas" "nightingal"
[1075] "clatterbridg" "47,000" "day-to-day"
[1078] "paramilitar" "inhabit" "intrus"
[1081] "notori" "itv" "furious"
[1084] "bullet" "reclassif" "detector"
[1087] "pope" "lea" "dastard"
[1090] "post-" "har" "ron"
[1093] "disreput" "pernici" "1987"
[1096] "simplest" "fractur" "cathedr"
[1099] "boldest" "applianc" "forcibl"
[1102] "kenyan" "munit" "matthew"
[1105] "76" "acquisit" "redirect"
[1108] "grubbi" "salvat" "bourn"
[1111] "analog" "far-flung" "counten"
[1114] "golf" "88" "stanst"
[1117] "outreach" "seventh" "injunct"
[1120] "landfil" "froze" "chan"
[1123] "lad" "variabl" "draconian"
[1126] "canvass" "short-chang" "government"
[1129] "smash" "cps" "shakespear"
[1132] "sophi" "shaw" "elabor"
[1135] "drivel" "incredul" "complement"
[1138] "turnaround" "ali" "echr"
[1141] "sharon" "sway" "stubborn"
[1144] "across-the-board" "disburs" "probe"
[1147] "disdain" "plate" "daventri"
[1150] "ransom" "hoyl" "patch"
[1153] "clwyd" "commando" "median"
[1156] "twelv" "770" "cds"
[1159] "disrespect" "unman" "homeland"
[1162] "legendari" "setback" "tinker"
[1165] "propens" "melt" "post-saddam"
[1168] "theresa" "karl" "demean"
[1171] "oil-for-food" "overthrow" "wari"
[1174] "trooper" "fertil" "peripher"
[1177] "abba" "henchmen" "reconstitut"
[1180] "fiona" "waltham" "92"
[1183] "pitt" "lyneham" "hercul"
[1186] "15-year-old" "dale" "celtic"
[1189] "vietnam" "jic" "behest"
[1192] "rebut" "tidying-up" "dunstabl"
[1195] "eddi" "potteri" "garrison"
[1198] "trader" "dfid" "unregist"
[1201] "oxfam" "hallam" "bicycl"
[1204] "alstom" "broughton" "filton"
[1207] "murrison" "galloway" "delta"
[1210] "preval" "niger" "nobbl"
[1213] "yvonn" "drink-driv" "fusili"
[1216] "booklet" "prestigi" "cocain"
[1219] "lister" "workload" "claw"
[1222] "uncost" "graffiti" "180,000"
[1225] "dave" "khabra" "tail"
[1228] "r" "deepcut" "conduc"
[1231] "privi" "lance-corpor" "67,000"
[1234] "log" "southport" "2.6"
[1237] "diageo" "elev" "cede"
[1240] "flown" "walter" "fifti"
[1243] "kit" "blackmail" "sing"
[1246] "55,000" "thai" "wildlif"
[1249] "280" "criminalis" "preliminari"
[1252] "statu" "under-16" "rational"
[1255] "ham" "campbeltown" "courier"
[1258] "machrihanish" "top-perform" "kerri"
[1261] "stain" "ivf" "37,000"
[1264] "somali" "ilo" "three-month"
[1267] "age-rel" "macular" "degener"
[1270] "rnib" "moon" "coffer"
[1273] "loughborough" "unqualifi" "wycomb"
[1276] "25th" "objection" "renounc"
[1279] "honorari" "cyclist" "horton"
[1282] "radcliff" "29,000" "nake"
[1285] "bulmer" "primaci" "imc"
[1288] "statin" "fallujah" "disqualifi"
[1291] "milburn" "falluja" "doom"
[1294] "malcolm" "outdat" "inadequaci"
[1297] "gratia" "indigen" "reagan"
[1300] "normandi" "shortest" "boston"
[1303] "disingenu" "redbridg" "amicus"
[1306] "patron" "loosen" "diego"
[1309] "garcia" "quartet" "email"
[1312] "163" "sarah" "beslan"
[1315] "duke" "adair" "leamington"
[1318] "asbo" "rung" "mccoll"
[1321] "resil" "practice-bas" "drawn-out"
[1324] "shoplift" "eyesight" "flanagan"
[1327] "yarl" "plastic" "csa"
[1330] "footprint" "chew" "gum"
[1333] "bullingdon" "unauthoris" "barrow"
[1336] "valiant" "sach" "self-suffici"
[1339] "minim" "misbehav" "classifi"
[1342] "neston" "coastlin" "authoritarian"
[1345] "montgomeryshir" "dixon" "snatch"
[1348] "caravan" "ranger" "bermondsey"
[1351] "unblock" "mire" "mallon"
[1354] "jcb" "mosqu" "heywood"
[1357] "a64" "acpo" "four-squar"
[1360] "mali" "cognit" "thirsk"
[1363] "overpay" "overpaid" "swede"
[1366] "soni" "high-skil" "stand-alon"
[1369] "protectionist" "singapor" "grandmoth"
[1372] "vaz" "glorifi" "tel"
[1375] "aviv" "madrassah" "hindu"
[1378] "payment-by-result" "changemak" "herceptin"
[1381] "goalpost" "4th" "90-day"
[1384] "well-fund" "31,000" "childlin"
[1387] "tibetan" "prostitut" "brightsid"
[1390] "mileston" "learner" "hayman"
[1393] "66,000" "carlil" "ilkeston"
[1396] "calendar" "chant" "regionalis"
[1399] "roadsid" "gibson" "basingstok"
[1402] "homicid" "nye" "inexplic"
[1405] "bulgarian" "healthier" "taunton"
[1408] "dunbartonshir" "dpp" "non-agricultur"
[1411] "unintend" "rachel" "plural"
[1414] "antiretrovir" "co-chair" "underperform"
[1417] "au" "howel" "humbl"
[1420] "gold-plat" "over-regul" "2.1"
[1423] "dougla" "knight" "glorif"
[1426] "dunfermlin" "toff" "off-peak"
[1429] "mask" "hard-head" "lakeland"
[1432] "tenfold" "chocol" "hayden"
[1435] "realm" "ma" "rossendal"
[1438] "afflict" "lieuten" "derek"
[1441] "malon" "nut" "punit"
[1444] "pre-releas" "holm" "lifeblood"
[1447] "82" "working-ag" "asbestos-rel"
[1450] "clamour" "roman" "charli"
[1453] "abstain" "mcnulti" "counterfeit"
[1456] "fring" "restat" "posthum"
[1459] "90th" "haltempric" "howden"
[1462] "norburi" "108" "chichest"
[1465] "petersburg" "3,600" "freer"
[1468] "visual" "hezbollah" "bombard"
[1471] "lebanes" "charlton" "darren"
[1474] "grower" "blyth" "drc"
[1477] "para" "cutback" "letwin"
[1480] "boiler" "hinkley" "mcdonald"
[1483] "cotswold" "burmes" "terri"
[1486] "polli" "toynbe" "bashir"
[1489] "orient" "centre-right" "pmqs"
[1492] "grenadi" "shalit" "mohammad"
[1495] "hemel" "hempstead" "fulham"
[1498] "kingsman" "provinci" "albion"
[1501] "collud" "habitat" "angela"
[1504] "cd" "bulb" "icon"
[1507] "luke" "resettl" "diploma"
[1510] "liam" "blog" "uninsur"
[1513] "wendi" "sailor" "ace"
[1516] "bromwich" "isc" "dick"
[1519] "guardsman" "thompson" "marx"
[1522] "dunn" "boon" "breastfeed"
[1525] "architectur" "daniel" "barrag"
[1528] "muppet" "spatial" "samantha"
[1531] "choke" "rigbi" "drummer"
[1534] "headley" "quash" "inter-faith"
[1537] "laps" "ryan" "rifleman"
[1540] "polyclin" "shopkeep" "danni"
[1543] "grit" "trump" "drainag"
[1546] "eye-wat" "20p" "temper"
[1549] "gould" "feed-in" "wootton"
[1552] "bassett" "malloch-brown" "ex-servicemen"
[1555] "annapoli" "1944" "cabl"
[1558] "upland" "mendelsohn" "zac"
[1561] "sub-prim" "draught-proof" "goldman"
[1564] "byron" "rainfal" "davo"
[1567] "3,700" "investigatori" "olmert"
[1570] "eco-town" "bastion" "suffragett"
[1573] "jessica" "spitfir" "castro"
[1576] "bag" "fyld" "abli"
[1579] "abstent" "290" "realloc"
[1582] "pillar" "embryolog" "embryo"
[1585] "thornton" "5.3" "pluck"
[1588] "calman" "man-mad" "rangoon"
[1591] "ki-moon" "asean" "remitt"
[1594] "adrian" "counter-insurg" "templ"
[1597] "tournament" "complicit" "proscript"
[1600] "gp-led" "woefulli" "anti-busi"
[1603] "medium-term" "smes" "bubbl"
[1606] "debit" "1976" "barack"
[1609] "estonia" "parkinson" "dodd"
[1612] "kagam" "dinosaur" "healey"
[1615] "tug" "improvis" "aaron"
[1618] "serjeant" "deferr" "short-tim"
[1621] "overcharg" "212" "nuneaton"
[1624] "kpmg" "1st" "quantit"
[1627] "sapper" "azimkar" "clair"
[1630] "speedo" "state-sponsor" "rio"
[1633] "tinto" "archiv" "3.9"
[1636] "5.2" "swine" "antivir"
[1639] "gurung" "kumar" "taxpayer-fund"
[1642] "westfield" "pressuris" "scrappag"
[1645] "65th" "boardroom" "cruis"
[1648] "2011-12" "panther" "merlin"
[1651] "hessl" "gesticul" "skelmersdal"
[1654] "ipsa" "northumbria" "samuel"
[1657] "louisa" "materialis" "aldershot"
[1660] "3rd" "rayner" "anglian"
[1663] "pyestock" "fairi" "4.9"
[1666] "it-" "haiti" "cadburi"
[1669] "kraft" "publican" "auschwitz-birkenau"
[1672] "calamit" "behead" "aircraftman"
[1675] "ashok" "ah" "redraw"
[1678] "bromsgrov" "said-" "mercian"
[1681] "reservist" "post-traumat" "pod"
[1684] "newburi" "rag" "550"
[1687] "whistl" "lantern" "presbyterian"
[1690] "backbench" "lash" "tornado"
[1693] "chip" "shoreditch" "bros"
[1696] "business-friend" "per-pupil" "movemb"
[1699] "one-year" "perth" "tottenham"
[1702] "one-in" "one-out" "back-to-work"
[1705] "bangor" "av" "uncontrol"
[1708] "a11" "internship" "benghazi"
[1711] "entrepreneuri" "flatlin" "allot"
[1714] "wouldn" "bend" "torch"
[1717] "lace" "porta" "mend"
[1720] "conroy" "under-occup" "wellb"
[1723] "re-hir" "#8211" "tebbutt"
[1726] "werritti" "kinsella" "firewal"
[1729] "contagion" "drug-driv" "179"
[1732] "arrow" "didn" "booki"
[1735] "kingswood" "aren" "hunter-kil"
[1738] "hegarti" "cerebr" "palsi"
[1741] "wade" "5.30" "lotus"
[1744] "dodger" "warsi" "allan"
[1747] "snooper" "darwen" "sas"
[1750] "higg" "googl" "butch"
[1753] "nicola" "savil" "2030"
[1756] "107,000" "in-work" "dementia-friend"
[1759] "clasp" "horsemeat" "ever-clos"
[1762] "jihadist" "clifton" "decarbonis"
[1765] "amritsar" "furnitur" "13p"
[1768] "wigton" "woolwich" "pluralist"
[1771] "rouhani" "elvi" "tyrel"
[1774] "crawl" "a14" "farag"
[1777] "asghar" "parrett" "dunlop"
[1780] "penzanc" "porn" "hinchingbrook"
[1783] "defibril" "manston" "juncker"
[1786] "gus" "donnel" "klesch"
[1789] "skipton"
govern.get.make.peopl.invest.spend.go (2%)
Let me explain what is happening in relation to the allowances. I apologise at the outset because some of it is complicated, and this is as I understand it. At the present time I am trying to give the explanation, if the House would be kind enough to listen. At the moment, for the Navy and Royal Marines, two different allowances have been amalgamated. One of those allowances the longer service at sea bonus is then split into two different types of payment. When all of it is amalgamated into one allowance, which is going to be called the longer separation allowance, the amount of credits under that particular part of the longer service at sea bonus will be deemed to be at roughly 60 per cent. That will mean that within that bonus there are those people who have accrued more than 60 per cent. who may receive less than they otherwise would. Will the House listen? However, that is more than compensated for by the fact that the new allowance is going to be paid at a bigger higher rate 25 rather than 12.80 and all personnel will be credited with an extra 100 days as the deemed separation. As a result of that, so I am informed, the letter that the Second Sea Lord sent to the Navy and Marines is correct people will not lose under that benefit. I am sorry, but this is the explanation. I spent a long time this morning trying to get to grips with this. In relation to the other allowance, the accumulated turbulence allowance, I am told that at present it kicks in when 280 days are served. That is now going to be amalgamated so that there is the one longer separation allowance. I am told that it is possible that some of those who are getting that allowance at present may receive less than they otherwise thought they would. However, the majority of them will receive more under the longer separation allowance. Quite apart from all of that, however, the new operational allowance tax free at 2,200 a year means that overall no one loses money and everyone gains money.
Other topics in document:
battalion.corpor.royal.regiment.die.lanc.1st (2%)
right.vaccin.govern.can.mmr.go.conserv (2%)
can.prime.make.servic.peopl.friend.countri (2%)
hon.govern.minist.parti.polici.go.make (2%)
While we are on the subject of former Home Secretaries, let me tell the right hon. Gentleman and I think this is why my right hon. Friends were shouting at him that the Home Secretary who began the process of removing embarkation controls was his predecessor as leader of the Conservative party, the right hon. and learned Member for Folkestone and Hythe (Mr. Howard) . I do not think that that was a very wise point to make. As for the Home Office, as a result of the changes that have been made crime is down, there are record numbers of police, asylum claims are now processed far more quickly, and we have reduced the number of asylum claims dramatically over the past few years. I agree that it is important to establish whether we can go further. I merely say to the right hon. Gentleman and his colleagues that I hope he will not do what he has done before, which is to attack us in public for not being toff tough enough or even toff enough! Whether toff or tough, the fact is I do not think we should get involved in a competition in that regard. Whatever the right hon. Gentleman says in public about our policy on law and order when he tries to suggest that we have not been tough enough what he actually does in the House and, in particular, with his colleagues in the other place is vote against each and every measure that is necessary. Incidentally, while we are on that subject, the right hon. Gentleman should correct something else that he did last week. He tried to suggest that the Sentencing Guidelines Council was the reason why he voted against the Criminal Justice Act 2003. Actually, he was in favour of the Sentencing Guidelines Council, and his party’s spokesman at the time said that it was admirable . The reason why Conservative Members voted against the 2003 Act was the issue of withdrawal of trial by jury on which, incidentally, they were also wrong. It was not because the measures were too soft; it was because they were too tough.
Other topics in document:
hon.govern.minist.parti.polici.go.make (2%)
right.vaccin.govern.can.mmr.go.conserv (2%)
hon.friend.right.work.new.year.import (2%)
govern.friend.take.minist.unit.servic.ireland (2%)
My hon. Friend is right in saying that it is important that we do everything that we can to achieve that second UN resolution. To be frank, many people thought that we might be in action even now, but we are not. We have delayed precisely in order to try to bring the international community back round the position that we set out in 1441. I go back to that the whole time. It was at the heart of the agreement that the United States take the multilateral path of the United Nations. The agreement was very simple. The United States had to agree to go through the United Nations, and to resolve this through the United Nations, but the other partners inside the United Nations agreed that, if Saddam did not fully comply and was in material breach, serious consequences and action would follow. The fact is, he has not complied. Four and a half months on indeed, 12 years on he has not complied. That it why it is important that we bring this issue to a head now and get it resolved. I remain, as I say, working flat out to get it resolved through the United Nations. That is easily the best thing. It will be a tragedy for the UN, when faced with this challenge, if it fails to meet it. However, we have to ensure that the unified will, as set out in 1441, is implemented.
Other topics in document:
hon.friend.get.work.agre.new.believ (2%)
right.can.friend.hon.servic.us.say (2%)
right.us.union.said.can.parti.last (2%)
Advantages
Disadvantages
Policy problem: Performance in standardised tests is strongly correlated with income, creating the potential for bias against lower-income students.
Research question: Are other components of admission files – such as written essays – less correlated with income than SATs?
Research Design:
Conclusions
Topical content strongly predicts household income
Topical content strongly predicts SAT scores
Even conditional on income, topics predict SAT scores
“Our results strongly suggest that the imprint of social class will be found in even the fuzziest of application materials.”
LDA can be embedded in more complicated models, embodying further intuitions about the structure of the texts.
The data generating distribution can be changed. We can apply mixed-membership assumptions to many kinds of data.
The posterior can be used in creative ways.
Correlated Topic Model (CTM)
Dynamic Topic Model (DTM)
Structural Topic Model (STM)
Typically, when estimating topic models we are interested in how some covariate is associated with the prevalence of topic usage (Gender, date, political party, etc)
The Structural Topic Model (STM) allows for the inclusion of arbitrary covariates of interest into the generative model
Topic prevalence is allowed to vary according to the covariates \(X\)
Topical content can also vary according to the covariates \(Y\)
LDA: Every document draws its topic mixture from the same prior
Topic prevalence model:
Topical content model:
Specify a linear model with:
\[ \theta_{dk} = \alpha + \gamma_{1k}*\text{labour}_{d(i)} \]
Topic 1 Top Words:
Highest Prob: minist, prime, govern, s, tell, confirm, ask
FREX: prime, minist, confirm, failur, paymast, lack, embarrass
Lift: protectionist, roadshow, harrison, booki, arrog, googl, pembrokeshir
Score: prime, minist, s, confirm, protectionist, govern, tell
Topic 2 Top Words:
Highest Prob: chang, review, made, target, fund, meet, depart
FREX: climat, flood, review, chang, environ, emiss, carbon
Lift: 2050, consequenti, parrett, dredg, climat, greenhous, barnett
Score: chang, flood, climat, review, target, environ, emiss
Topic 3 Top Words:
Highest Prob: servic, health, nhs, care, hospit, nation, wait
FREX: cancer, patient, nhs, health, hospit, gp, doctor
Lift: horton, scotsman, wellb, clinician, herceptin, polyclin, healthcar
Score: health, nhs, servic, hospit, cancer, patient, nurs
Topic 4 Top Words:
Highest Prob: decis, vote, made, parti, elect, propos, debat
FREX: vote, liber, debat, scottish, decis, recommend, scotland
Lift: calman, gould, imc, wakeham, in-built, ipsa, jenkin
Score: vote, democrat, decis, parti, debat, liber, elect
Topic 5 Top Words:
Highest Prob: secretari, said, state, last, week, inquiri, report
FREX: deputi, warn, resign, inquiri, alleg, statement, servant
Lift: donnel, gus, revolutionari, sixsmith, column, bend, coulson
Score: secretari, deputi, inquiri, committe, said, state, alleg
Topic 6 Top Words:
Highest Prob: northern, ireland, meet, agreement, process, talk, peopl
FREX: ireland, northern, agreement, ira, down, sinn, decommiss
Lift: clamour, haass, tibetan, dalai, lama, tibet, presbyterian
Score: ireland, northern, agreement, peac, meet, process, down
Topic 7 Top Words:
Highest Prob: hous, home, build, need, common, plan, social
FREX: rent, hous, afford, properti, buy, lesson, site
Lift: fairi, rung, rent, greenfield, owner-occupi, bed-and-breakfast, tenant
Score: hous, home, build, rent, afford, common, fairi
Topic 8 Top Words:
Highest Prob: year, offic, polic, last, month, ago, two
FREX: four, three, ago, promis, month, five, six
Lift: templ, eye-catch, dixon, folder, paperwork, cutback, mug
Score: year, polic, crime, offic, figur, promis, month
Topic 9 Top Words:
Highest Prob: countri, world, peopl, forc, troop, afghanistan, aid
FREX: africa, taliban, zimbabw, aid, afghan, troop, g8
Lift: mbeki, madrassah, mandela, shi'a, thabo, non-agricultur, zimbabwean
Score: afghanistan, troop, iraq, iraqi, aid, afghan, africa
Topic 10 Top Words:
Highest Prob: bank, busi, energi, price, action, financi, take
FREX: price, bank, lend, energi, market, regul, financi
Lift: contagion, lender, recapitalis, depositor, okay, payday, ofgem
Score: bank, energi, price, busi, market, regul, okay
Topic 11 Top Words:
Highest Prob: school, educ, children, univers, parent, student, teacher
FREX: pupil, student, school, teacher, educ, fee, univers
Lift: 11-plus, grant-maintain, meal, numeraci, underachiev, learner, per-pupil
Score: school, educ, children, univers, teacher, student, pupil
Topic 12 Top Words:
Highest Prob: hon, right, friend, member, agre, may, mr
FREX: member, friend, right, hon, york, witney, richmond
Lift: dorri, dewar, cowdenbeath, kirkcaldi, nadin, hain, neath
Score: friend, hon, right, member, mr, dorri, agre
Topic 13 Top Words:
Highest Prob: per, cent, 20, 10, 50, increas, 100
FREX: cent, per, 50, 15, 20, 60, 40
Lift: unrealist, slaughtermen, ppp, cent, outbreak, per, maff
Score: per, cent, 20, 50, unrealist, 15, billion
Topic 14 Top Words:
Highest Prob: mr, money, word, taxpay, speaker, public, much
FREX: speaker, mail, strike, taxpay, gold, valu, blair
Lift: davo, measl, jab, trussel, brightsid, mail, spiv
Score: mr, speaker, taxpay, davo, word, strike, mail
Topic 15 Top Words:
Highest Prob: number, increas, result, peopl, train, year, addit
FREX: number, train, overal, recruit, 1997, equip, increas
Lift: stubborn, ta, midwiferi, largest-ev, 180,000, dentist, improvis
Score: number, increas, train, invest, 1997, equip, defenc
Topic 16 Top Words:
Highest Prob: pension, benefit, peopl, help, work, million, poverti
FREX: disabl, pension, post, poverti, benefit, payment, retir
Lift: adair, eyesight, off-peak, sub-post, sub-postmast, over-75, concessionari
Score: pension, benefit, disabl, post, poverti, child, welfar
Topic 17 Top Words:
Highest Prob: law, act, legisl, crime, prison, peopl, measur
FREX: prison, asylum, crimin, releas, deport, offenc, law
Lift: conduc, investigatori, porn, indetermin, parol, pre-releas, deport
Score: prison, crime, crimin, sentenc, law, asylum, drug
Topic 18 Top Words:
Highest Prob: conserv, govern, parti, spend, polici, money, gentleman
FREX: conserv, spend, oppos, tori, parti, previous, polici
Lift: snooper, attle, bawl, saatchi, family-friend, tori, chef
Score: conserv, spend, parti, oppos, money, cut, billion
Topic 19 Top Words:
Highest Prob: european, union, britain, europ, british, countri, rule
FREX: european, europ, treati, currenc, eu, union, constitut
Lift: overtaken, super-st, tidying-up, super-pow, lafontain, isc, lisbon
Score: european, union, europ, referendum, treati, constitut, britain
Topic 20 Top Words:
Highest Prob: unit, kingdom, iraq, state, nation, weapon, secur
FREX: palestinian, weapon, resolut, israel, destruct, kingdom, mass
Lift: 1441, palestinian, two-stat, chess, hama, jenin, quartet
Score: unit, un, iraq, weapon, saddam, kingdom, palestinian
Topic 21 Top Words:
Highest Prob: constitu, concern, awar, can, suffer, case, assur
FREX: mother, miner, mrs, compens, suffer, aircraft, mine
Lift: manston, tebbutt, tyrel, asbestosi, byron, ex-min, norburi
Score: constitu, compens, suffer, death, awar, tebbutt, safeti
Topic 22 Top Words:
Highest Prob: join, famili, tribut, pay, express, live, serv
FREX: condol, sympathi, regiment, tribut, sacrific, veteran, servicemen
Lift: aaron, chant, guardsman, gurung, khabra, mercian, spitfir
Score: tribut, condol, join, afghanistan, famili, express, sympathi
Topic 23 Top Words:
Highest Prob: make, issu, hon, import, gentleman, look, can
FREX: issu, proper, look, obvious, certain, understand, point
Lift: canvass, launder, quasi-judici, biodivers, offhand, obvious, certain
Score: issu, gentleman, import, hon, point, make, look
Topic 24 Top Words:
Highest Prob: invest, london, region, transport, develop, constitu, project
FREX: project, rail, scienc, infrastructur, transport, research, north
Lift: duall, electrifi, skelmersdal, wigton, dawlish, electrif, stoneheng
Score: invest, transport, region, rail, scienc, project, infrastructur
Topic 25 Top Words:
Highest Prob: local, communiti, council, author, support, polic, peopl
FREX: behaviour, antisoci, local, counti, club, footbal, author
Lift: blyth, changemak, asbo, graffiti, csos, darwen, under-16
Score: local, communiti, antisoci, behaviour, council, author, polic
Topic 26 Top Words:
Highest Prob: job, work, peopl, unemploy, economi, busi, help
FREX: unemploy, employ, growth, sector, long-term, apprenticeship, creat
Lift: skipton, entrepreneuri, sector-l, sandwich, back-to-work, entrepreneur, unemploy
Score: unemploy, job, economi, sector, employ, growth, busi
Topic 27 Top Words:
Highest Prob: say, let, said, want, labour, go, question
FREX: answer, question, let, got, shadow, listen, wrong
Lift: wriggl, airbrush, mccluskey, re-hir, pre-script, beveridg, bandwagon
Score: labour, answer, question, let, gentleman, said, say
Topic 28 Top Words:
Highest Prob: tax, pay, cut, budget, famili, rate, peopl
FREX: tax, vat, low, budget, top, revenu, incom
Lift: flatter, 45p, non-domicil, 107,000, 50p, clifton, millionair
Score: tax, cut, pay, incom, rate, famili, budget
Topic 29 Top Words:
Highest Prob: industri, compani, worker, british, manufactur, job, trade
FREX: manufactur, industri, product, plant, steel, worker, car
Lift: alstom, gum, jcb, peugeot, klesch, chew, dairi
Score: industri, manufactur, compani, worker, export, farmer, farm
Topic 30 Top Words:
Highest Prob: govern, can, mani, support, peopl, countri, take
FREX: mani, support, govern, unlik, give, come, take
Lift: philip, unlik, unbeliev, mani, though, leav, despit
Score: philip, govern, mani, unlik, support, can, peopl
Highest Prob is the raw \(\beta\) coefficients (the word probabilities for a given topic, are allowed to vary according to the covariates)FREX is a measure which combines word-topic frequency with word-topic exclusivityLift is a normalised version of the word-probabilitiesScore is the term-score measure we defined above
Topic 3:
I suspect that many Members from all parties in this House will agree that mental health services have for too long been treated as a poor cousin a Cinderella service in the NHS and have been systematically underfunded for a long time. That is why I am delighted to say that the coalition Government have announced that we will be introducing new access and waiting time standards for mental health conditions such as have been in existence for physical health conditions for a long time. Over time, as reflected in the new NHS mandate, we must ensure that mental health is treated with equality of resources and esteem compared with any other part of the NHS.
I am sure that the Prime Minister will join me in congratulating Cheltenham and Tewkesbury primary care trust on never having had a financial deficit and on living within its means. Can he therefore explain to the professionals, patients and people of Cheltenham why we are being rewarded with the closure of our 10-year-old purpose-built maternity ward, the closure of our rehabilitation hospital, cuts in health promotion, cuts in community nursing, cuts in health visiting, cuts in access to acute care and the non-implementation of new NICE-prescribed drugs such as Herceptin?
I am sure that the Prime Minister will join me in congratulating Cheltenham and Tewkesbury primary care trust on never having had a financial deficit and on living within its means. Can he therefore explain to the professionals, patients and people of Cheltenham why we are being rewarded with the closure of our 10-year-old purpose-built maternity ward, the closure of our rehabilitation hospital, cuts in health promotion, cuts in community nursing, cuts in health visiting, cuts in access to acute care and the non-implementation of new NICE-prescribed drugs such as Herceptin?
Do MPs from different parties speak about healthcare at different rates?
#Regress the prevalence of Topic 3 on party affiliation.
#Use the STM you already estimated
#Pull party labels from document-level metadata
stm_effects <- estimateEffect(formula = c(3) ~ party.reduced,
stmobj = stmOut,
metadata = docvars(pmq_dfm))
plot.estimateEffect(stm_effects,
covariate = "party.reduced",
method = "pointestimate",
xlim = c(0.025, 0.045))For each document \(d\):
\(\theta_{d,3} = \text{expected share of words about healthcare}\)
STM estimates: How does the expected proportion of T3-related words change by party?
Who devotes a larger proportion of their speech to T3?
On which topics do Conservative and Labour MPs differ the most?
Do liberal and conservative newspapers report on the economy in different ways?
Lucy Barnes and Tim Hicks (UCL) study the determinants of voters’ attitudes toward government deficits. They argue that individual attitudes are largely a function of media framing. They examine whether and how the Guardian (a left-leaning) and the Telegraph (a right-leaning) report on the economy.
Data and approach:
\(\approx 10,000\) newspaper articles
STM model ( Which (economic topics appear (prevalence), how those topics are framed linguistically (content))
\(K = 6\)
Newspaper covariates for both prevalence and content
LDA, and topic models more generally, require the researcher to make several implementation decisions
In particular, we must select a value for \(K\), the number of topics
How can we select between different values of K?
How can we tell how well a given topic model is performing?
Predictive metric: Held-out likelihood
Ask which words the model believes will be in a given document and comparing this to the document’s actual word composition (i.e. calculate the held-out likelihood)
E.g. Splitting texts in half, train a topic model on one half, calculate the held-out likelihood for the other half
Problem: Prediction is not always important in exploratory or descriptive tasks. We may want models that capture other aspects of the data.
Interpretational metrics
Semantic coherence
Exclusivity
Problem: The correlation between quantitative diagnostics such as these and human judgements of topic coherence is not always positive!
We can apply many of these metrics across a range of topic models using the searchK function in the stm package.
Word intrusion: Test if topics have semantic coherence by asking humans identify a spurious word inserted into a topic.
| Topic | \(w_1\) | \(w_2\) | \(w_3\) | \(w_4\) | \(w_5\) | \(w_6\) |
|---|---|---|---|---|---|---|
| 1 | bank | financ | terror | england | fiscal | market |
| 2 | europe | union | eu | referendum | vote | school |
| 3 | act | deliv | nhs | prison | mr | right |
Assumption: When humans find it easy to locate the “intruding” word, the topics are more coherent.
Topic intrusion: Test if the association between topics and documents makes sense by asking humans to identify a topic that was not associated with a document.
Reforms to the banking system are an essential part of dealing with the crisis, and delivering lasting and sustainable growth to the economy. Without these changes, we will be weaker, we will be less well respected abroad, and we will be poorer.
| Topic | \(w_1\) | \(w_2\) | \(w_3\) | \(w_4\) | \(w_5\) | \(w_6\) |
|---|---|---|---|---|---|---|
| 1 | bank | financ | regul | england | fiscal | market |
| 2 | plan | econom | growth | longterm | deliv | sector |
| 3 | school | educ | children | teacher | pupil | class |
Assumption: When humans find it easy to locate the “intruding” topic, the mappings are more sensible.
Conclusion:
“Topic models which perform better on held-out likelihood may infer less semantically meaningful topics.” (Chang et al. 2009.)
Semantic validity
Predictive validity
Construct validity
Implication: All these approaches require careful human reading of texts and topics, and comparison with sensible metadata.
Topic models offer an approach to automatically inferring the substantive themes that exist in a corpus of texts
A topic is described as a probability distribution over words in the vocabulary
Documents are described as a mixture of corpus wide topics
Topic models require very little up-front effort, but require extensive interpretation and validation
PUBL0099