Let’s take a look at a US Senate debate on partial birth abortion. As always, we’ll load the texts, make a corpus, then a document term matrix to start off
library(tidyverse)
library(quanteda)
library(quanteda.textmodels)
library(quanteda.textstats)
library(ggrepel) # for labelling ggplots
In this exercise we’ll continue to look at the US Debate, but this time from a scaling perspective. If these speakers were expressing their positions on a latent dimension using words, where would they all be?
We’ll use the pre-constructed data frame of speaker contributions and make a corpus out of it. Because we’ll be interested in speaker ideal points we will assume that all their contributions to the debate are always expressing the same point, so it’s reasonable bundle them all into one document, i.e. documents are speakers
load("data/corpus_us_debate_speaker.rda")
summary(corpus_us_debate_speaker)
## Corpus consisting of 23 documents, showing 23 documents:
##
## Text Types Tokens Sentences party speaker
## ALLARD 400 1165 53 R ALLARD
## BOND 129 232 9 R BOND
## BOXER 2231 18527 886 D BOXER
## BROWNBACK 646 2884 168 R BROWNBACK
## BUNNING 281 593 32 R BUNNING
## CANTWELL 395 1114 55 D CANTWELL
## DeWINE 463 1438 75 R DeWINE
## DOMENICI 203 479 27 R DOMENICI
## DURBIN 520 1874 89 D DURBIN
## ENSIGN 410 1235 66 R ENSIGN
## FEINGOLD 213 414 19 D FEINGOLD
## FEINSTEIN 1348 6112 284 D FEINSTEIN
## FRIST 222 501 25 R FRIST
## HARKIN 655 2612 145 D HARKIN
## HATCH 451 1173 60 R HATCH
## LAUTENBERG 678 2131 123 D LAUTENBERG
## MIKULSKI 260 631 42 D MIKULSKI
## NELSO 155 328 17 D NELSO
## NICKLES 280 568 26 R NICKLES
## REID 38 59 4 D REID
## SANTORUM 1593 10507 458 R SANTORUM
## SESSIONS 653 2174 111 R SESSIONS
## VOINOVICH 186 363 19 R VOINOVICH
corp <- corpus_us_debate_speaker # give it a shorter name!
Now we’ll make a document term matrix
# add 2 but remove 6
stops <- c(setdiff(stopwords(),
c("her", "she", "hers", "he", "his", "him")),
"bill", "can")
toks <- tokens(corp,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE) |>
tokens_remove(stops) |>
tokens_tolower()
corpdfm <- dfm(toks)
Before we took care to remove a lot of things we though would be confusing.
Here let’s live dangerously and use all the words. (The models being fit are a lot more constrained than before). You should check these decisions don’t affect much.
We’ll run a simple one dimensional scaling model, known to ‘wordfish’ to political scientists, and an association model to everyone else
wfish <- textmodel_wordfish(corpdfm, dir = c(3, 21))
summary(wfish)
##
## Call:
## textmodel_wordfish.dfm(x = corpdfm, dir = c(3, 21))
##
## Estimated Document Positions:
## theta se
## ALLARD 0.02165 0.06304
## BOND -0.08693 0.12951
## BOXER -1.05777 0.01098
## BROWNBACK 2.38270 0.05180
## BUNNING 0.81777 0.09908
## CANTWELL -1.25635 0.03847
## DeWINE 1.38729 0.06900
## DOMENICI 0.32865 0.10463
## DURBIN -0.25970 0.04686
## ENSIGN 0.83659 0.07612
## FEINGOLD -1.20438 0.06369
## FEINSTEIN -1.59354 0.01318
## FRIST 0.83881 0.11169
## HARKIN -0.79095 0.03345
## HATCH 0.97740 0.07396
## LAUTENBERG -0.65712 0.03848
## MIKULSKI -1.22626 0.05053
## NELSO -1.10229 0.07946
## NICKLES 0.13604 0.08712
## REID 0.20834 0.27792
## SANTORUM 0.48882 0.02367
## SESSIONS 0.62232 0.05214
## VOINOVICH 0.18892 0.11350
##
## Estimated Feature Scores:
## mr president commend senator pennsylvania santorum frist
## beta 0.06468 0.05462 -0.02596 0.06741 -0.1341 0.1254 0.4459
## psi 0.87009 1.50382 -2.04487 1.85843 -0.6193 -0.4688 -1.7535
## leadership particular issue worked extremely hard also presiding
## beta 0.01234 -0.3105 0.2177 0.1999 -0.04553 0.3747 -0.1087 -0.2618
## psi -0.93473 -0.3830 1.0103 -3.1011 -1.76368 -2.0193 0.6330 -2.1476
## officer his rights unborn pleased cosponsor partial-birth abortion
## beta -0.2618 0.5470 0.3719 0.3750 -0.06108 0.003728 0.2335 -0.1562
## psi -2.1476 0.9393 0.8146 -0.6331 -1.20952 -1.748114 1.6869 2.7850
## act s legislation designed help protect unnecessary
## beta -0.4283 -0.8522 -0.5748 -0.07312 0.5454 -0.2535 -0.2491
## psi 0.4152 -0.1480 1.1859 -0.67491 -0.8843 0.7095 -0.8415
the dir
argument sets the meaning of left and right. We are going to assert that the 3rd speaker Barbara Boxer, is to the left of the 21st speaker Rick Santorum. We don’t make any claims about how far to the left, or about any of the other speakers. Why to we do this?
The model we are fitting says that each association is modelled by the multiplication of a speaker position \(\theta\) and a word position \(\beta\). So if they are both positive or both negative we see the speaker using that word more. However, we don’t specify either \(\theta\) or \(\beta\) values so there are always two model that fit identically: the one with \(\theta_i\beta_j\) and the one with \((-\theta_i)(-\beta_j)\). The second has all the sorts the speakers the opposite way to the first, but we can’t tell because it also sorts the word positions the other way to make up for it. Since we’d like to be interpretable in substantive terms, we can set a left and right speaker to orient the model, without adding any real information to it.
It doesn’t really matter if you do this step or not. We could always do a bit of multiplication after the model is fit. Just don’t be surprised when people you expect on one side end up on the other. For non-ideological contexts i.e. when the dimension doesn’t have a natural direction, this whole thing will matter even less.
The fitted model contains all the components you saw in the lectures, organized in two blocks
wf_coef <- coef(wfish)
names(wf_coef)
## [1] "documents" "features"
head(wf_coef$documents) # speaker parameters
## theta alpha
## ALLARD 0.02165274 0.21348639
## BOND -0.08693357 -1.29227632
## BOXER -1.05777079 2.76206193
## BROWNBACK 2.38269597 0.12975044
## BUNNING 0.81776695 -0.48868033
## CANTWELL -1.25634814 -0.07911048
head(wf_coef$features) # word parameters
## beta psi
## mr 0.06468449 0.8700898
## president 0.05462191 1.5038171
## commend -0.02595502 -2.0448676
## senator 0.06740927 1.8584309
## pennsylvania -0.13413832 -0.6192956
## santorum 0.12538330 -0.4688482
The estimated speaker positions can be extracted as if this were a regular fitted regression model. It’s nice to get some uncertainty around these positions too
preds <- predict(wfish, interval = "confidence")
preds
## $fit
## fit lwr upr
## ALLARD 0.02165274 -0.10190180 0.1452073
## BOND -0.08693357 -0.34076934 0.1669022
## BOXER -1.05777079 -1.07929641 -1.0362452
## BROWNBACK 2.38269597 2.28117382 2.4842181
## BUNNING 0.81776695 0.62356627 1.0119676
## CANTWELL -1.25634814 -1.33174776 -1.1809485
## DeWINE 1.38728909 1.25204737 1.5225308
## DOMENICI 0.32865152 0.12357508 0.5337280
## DURBIN -0.25969664 -0.35154658 -0.1678467
## ENSIGN 0.83658945 0.68738863 0.9857903
## FEINGOLD -1.20437761 -1.32919832 -1.0795569
## FEINSTEIN -1.59354288 -1.61938317 -1.5677026
## FRIST 0.83880779 0.61988997 1.0577256
## HARKIN -0.79094571 -0.85650252 -0.7253889
## HATCH 0.97739715 0.83244576 1.1223485
## LAUTENBERG -0.65712119 -0.73253786 -0.5817045
## MIKULSKI -1.22625771 -1.32530022 -1.1272152
## NELSO -1.10229306 -1.25803199 -0.9465541
## NICKLES 0.13603646 -0.03471560 0.3067885
## REID 0.20834388 -0.33636489 0.7530527
## SANTORUM 0.48881739 0.44242432 0.5352105
## SESSIONS 0.62231611 0.52012750 0.7245047
## VOINOVICH 0.18892279 -0.03352536 0.4113709
As in the topic model case, we need to work a bit to extract the (only) element in preds
called ‘fit’ to get actual table of positions and associated uncertainty. When we’ve dug it out, we’ll attach the docvars to keep everything together
# make a data frame from fit matrix and the document variables
ests <- data.frame(preds$fit)
speaker_pos <- data.frame(docvars(corpdfm), ests) |>
arrange(fit) # sort left (negative) to right (positive)
head(speaker_pos)
## party speaker fit lwr upr
## FEINSTEIN D FEINSTEIN -1.593543 -1.619383 -1.5677026
## CANTWELL D CANTWELL -1.256348 -1.331748 -1.1809485
## MIKULSKI D MIKULSKI -1.226258 -1.325300 -1.1272152
## FEINGOLD D FEINGOLD -1.204378 -1.329198 -1.0795569
## NELSO D NELSO -1.102293 -1.258032 -0.9465541
## BOXER D BOXER -1.057771 -1.079296 -1.0362452
Let’s plot these positions. I’ll use ggplot2
N <- nrow(speaker_pos)
speaker_names <- speaker_pos$speaker
ggplot(speaker_pos, aes(x = fit, y = 1:N, col = party)) +
geom_point() +
geom_errorbarh(aes(xmin = lwr, xmax = upr), height = 0) +
scale_color_manual(values = c("blue", "red")) +
scale_y_continuous(labels = speaker_names, breaks = 1:N) +
labs(x = "Position", y = "Speaker") +
ggtitle("Estimated Positions from Senate Partial-Birth Abortion Debate",
subtitle = "October 21st, 2003")
Let’s extract the word positions for later:
word_pos <- data.frame(word = wfish$features,
position = wfish$beta,
offset = wfish$psi)
We can get some idea of the nature of the debate language by looking for words we think ought to be charged. Take a moment to make some predictions about the scores of these:
testwords <- c("life", "choice", "womb", "her", "woman", "health",
"born", "baby", "little", "gruesome", "kill", "roe",
"wade", "medical", "her", "his", "child", "religion",
"catholic", "doctor", "nurse")
Now let’s see
testscores <- word_pos |>
filter(word %in% testwords) |>
arrange(position)
testscores[,1:2] # just word and score columns
## word position
## 1 catholic -0.87235748
## 2 religion -0.87235748
## 3 health -0.67988781
## 4 woman -0.67535054
## 5 choice -0.46990160
## 6 medical -0.45263277
## 7 roe -0.43295110
## 8 wade -0.31416752
## 9 her -0.18386806
## 10 doctor -0.01346807
## 11 life 0.18469327
## 12 baby 0.48621181
## 13 born 0.50502110
## 14 his 0.54704842
## 15 little 0.57050720
## 16 kill 0.63760060
## 17 gruesome 0.66857723
## 18 child 0.68572689
## 19 nurse 0.71239098
## 20 womb 1.10874972
If we were Slapin and Proksch we might put all this on one of their Eiffel Tower diagrams…
ggplot(word_pos, aes(position, offset, label = word)) +
geom_point(color = "grey", alpha = 0.2) +
geom_text_repel(data = testscores, col = "black") +
geom_point(data = testscores) +
labs(x = "Word position", y = "Offset parameter") +
ggtitle("Estimated word positions for selected debate vocabulary",
subtitle = "Note: Offset parameter is roughly proportional to word frequency")
One thing to take away from this is that the more frequent the word (higher on the y axis) the less likely it is to express an extreme position. None of our test words were particularly extreme, so you might want to take a look at those unlabelled extremes. You can do that bby puling all the word positions and sorting them in order of beta, then looking at the lowest and highest, e.g. using
head
or tail
.
If we were being thorough about these words we’d check they do what we think they do by looking by looking at them in all their contexts, as we did before. This one is somewhat gruesome, so I’ll let you run it yourself, or you can choose another word of interest.
kwic(tokens(corp), "baby", window = 15)
To fit a more than one dimensional scaling model we need to switch models. We’ll use correspondence analysis (CA) for efficiency, which quanteda wraps up into the function textmodel_ca
. And just as textmodel_lr
actually called functions from {glmnet}
, textmodel_ca
is a light wrapper around the {ca}
, the venerable correspondence analysis package. Let’s take it for a spin with two dimensions
mod2 <- textmodel_ca(corpdfm, nd = 2)
Let’s see what we get from this model
ca_coefs <- coef(mod2)
names(ca_coefs)
## [1] "coef_feature" "coef_feature_se" "coef_document" "coef_document_se"
This looks promising, until we realize that we’re only getting the first dimension from coef_document
and coef_feature
ca_coefs$coef_document
## ALLARD BOND BOXER BROWNBACK BUNNING CANTWELL
## -0.26237636 -0.18086402 0.60798719 -3.05908611 -0.90092468 0.84585647
## DeWINE DOMENICI DURBIN ENSIGN FEINGOLD FEINSTEIN
## -1.46656126 -0.47811240 -0.03301386 -0.94315418 0.77460379 1.19487689
## FRIST HARKIN HATCH LAUTENBERG MIKULSKI NELSO
## -0.95364604 0.38041003 -1.13512695 0.24535845 0.82459942 0.69317035
## NICKLES REID SANTORUM SESSIONS VOINOVICH
## -0.35322982 -0.42473104 -0.65283838 -0.76297152 -0.37723043
But since we’re here, let’s compare these positions to the association model ones we saw earlier. Happily the speaker order in exactly the same way (you can check that), so we can simply correlate them
cor(ca_coefs$coef_document, ests$fit)
## [1] -0.9887326
And here you’ll see that a) they agree very closely, but b) they’re in the opposite direction, because we didn’t force a direction for theta and the fitting process chose the other way. Hence if we want to interpret these then we should either multiple theta and beta by -1 or just practice our lateral thinking skills…
The other thing that’s noticeable on inspection is that there are no measures of uncertainty around anything.
ca_coefs$coef_document_se # seems rather rude
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
These are possible to compute in CA (although not in this package), but a bit more expensive. Ask if you want to know how.
Let’s instead make use of the fact that the fitted model is a {ca}
model and use the package’s own functions. The speakers are rows and the words are columns, so the package offers us a general purpose way to get row coordinates and column coordinates for arbitrary dimensions. We fit 2 so let’s get speaker positions in the first two dimensions
library(ca) # this comes with quanteda.textmodels
speaker_pos2 <- cacoord(mod2, dim = 1:2, rows = TRUE)
speaker_pos2
## Dim1 Dim2
## ALLARD -0.26237636 -1.3915416
## BOND -0.18086402 -1.3675989
## BOXER 0.60798719 0.5297288
## BROWNBACK -3.05908611 2.8875633
## BUNNING -0.90092468 -1.6842824
## CANTWELL 0.84585647 0.3283115
## DeWINE -1.46656126 -1.3136832
## DOMENICI -0.47811240 -1.1858946
## DURBIN -0.03301386 0.1726151
## ENSIGN -0.94315418 -0.6092284
## FEINGOLD 0.77460379 0.1559779
## FEINSTEIN 1.19487689 0.1530177
## FRIST -0.95364604 -3.2332382
## HARKIN 0.38041003 0.3028870
## HATCH -1.13512695 -1.6493573
## LAUTENBERG 0.24535845 0.2934741
## MIKULSKI 0.82459942 0.2156859
## NELSO 0.69317035 -0.1825432
## NICKLES -0.35322982 -2.1983291
## REID -0.42473104 -0.6159025
## SANTORUM -0.65283838 -0.6556323
## SESSIONS -0.76297152 -1.3123759
## VOINOVICH -0.37723043 -1.3371385
These are, of course, better plotted. We can either extract do this ourselves:
sp_pos <- data.frame(speaker_pos2,
speaker = rownames(speaker_pos2)) # add the speaker as a column
ggplot(sp_pos, aes(Dim1, Dim2)) +
geom_point() +
geom_text_repel(aes(label = speaker))
or just use the rather comprehensive plot function provided by
{ca}
. Since there are a lot of words, we could use this function to plot the speakers only
plot(mod2, what = c("all", "none"))
(Remembering that this model has dimension 1 flipped).
If you want the words and not the speakers, change what
to be c("none", "all")
, and get ready to make a really busy plot…
If you want both then leave out what
altogether. The only downside to this function is that we’d probably like to flip the sign of at least the first axis when we explain this model to people, but there isn’t an easy way to do it. If we take back control of plotting, we can, but then it’s more work.
But already we can see that from this plot that Brownback is not just extreme on the first dimension but talks quite differently to all the other speakers. What’s happening here? (Maybe we should take a look at Frist too)
Let’s see what Brownback talks about
bb <- textstat_keyness(corpdfm, target = "BROWNBACK")
head(bb, 20)
## feature chi2 p n_target n_reference
## 1 samuel 396.36357 0.000000e+00 22 1
## 2 hand 367.35657 0.000000e+00 26 8
## 3 womb 280.15762 0.000000e+00 26 17
## 4 person 188.46923 0.000000e+00 18 12
## 5 he 151.74913 0.000000e+00 33 78
## 6 property 140.87980 0.000000e+00 9 1
## 7 child 127.78792 0.000000e+00 32 87
## 8 surgery 99.71526 0.000000e+00 11 9
## 9 utero 83.43004 0.000000e+00 6 1
## 10 surgeries 79.48003 0.000000e+00 5 0
## 11 beautiful 59.90493 9.992007e-15 4 0
## 12 photographer 59.90493 9.992007e-15 4 0
## 13 testify 59.90493 9.992007e-15 4 0
## 14 usa 59.90493 9.992007e-15 4 0
## 15 unique 46.54580 8.950174e-12 4 1
## 16 amongst 40.51390 1.952226e-10 3 0
## 17 famous 40.51390 1.952226e-10 3 0
## 18 needle 40.51390 1.952226e-10 3 0
## 19 responsibilities 40.51390 1.952226e-10 3 0
## 20 sacred 40.51390 1.952226e-10 3 0
These are the distinctive parts of his vocabulary, and we can check how they land in the second dimension
# extract word positions
word_pos2 <- cacoord(mod2, dim = 1:2, cols = TRUE)
bb_words <- data.frame(word_pos2, word = rownames(word_pos2)) |>
filter(word %in% bb$feature[1:20]) |>
arrange(Dim2)
bb_words
## Dim1 Dim2 word
## child -1.800534 0.5388316 child
## he -1.629173 1.1061740 he
## womb -3.244048 2.5033597 womb
## person -3.061756 2.8073656 person
## surgery -2.125780 3.1464641 surgery
## hand -3.920159 3.7798270 hand
## utero -4.480888 3.9808858 utero
## unique -3.689256 4.1618744 unique
## property -4.535972 4.5031753 property
## samuel -4.666632 4.9507920 samuel
## surgeries -4.923228 5.1330251 surgeries
## testify -4.923228 5.1330251 testify
## beautiful -4.923228 5.1330251 beautiful
## photographer -4.923228 5.1330251 photographer
## usa -4.923228 5.1330251 usa
## amongst -4.923228 5.1330251 amongst
## needle -4.923228 5.1330251 needle
## famous -4.923228 5.1330251 famous
## responsibilities -4.923228 5.1330251 responsibilities
## sacred -4.923228 5.1330251 sacred
After this cursory glance it seems that Brownback is his own ‘dimension’, or rather, his distinctive speech has been interpreted as one pole of a dimension by this model that is looking for dimensions.
How do things look if we take him out and refit the model?
corp_nobb <- corpus_subset(corp, speaker != "BROWNBACK")
toks_nobb <- tokens(corp_nobb,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE) |>
tokens_remove(stops)
mod2_nobb <- textmodel_ca(dfm(toks_nobb), nd = 2)
Let’s do a quick comparison
pos <- data.frame(cacoord(mod2, dim = 1:2, rows = TRUE))
pos_nobb <- data.frame(cacoord(mod2_nobb, dim = 1:2, rows = TRUE))
# connect the positions from both runs of the model together
pos_both <- merge(pos, pos_nobb, by = "row.names",
suffixes = c("",".nobb"))
head(pos_both)
## Row.names Dim1 Dim2 Dim1.nobb Dim2.nobb
## 1 ALLARD -0.2623764 -1.3915416 -0.8893293 -0.9578887
## 2 BOND -0.1808640 -1.3675989 -0.7678698 -1.4052542
## 3 BOXER 0.6079872 0.5297288 0.6945660 0.5502065
## 4 BUNNING -0.9009247 -1.6842824 -1.6493941 -0.4302980
## 5 CANTWELL 0.8458565 0.3283115 0.8732419 -0.3398072
## 6 DeWINE -1.4665613 -1.3136832 -2.1651482 1.7980709
# does Brownback affect the first dimension?
cor(pos_both$Dim1, pos_both$Dim1.nobb)
## [1] 0.9775541
# how about the second?
cor(pos_both$Dim2, pos_both$Dim2.nobb)
## [1] 0.6380843
These correlation suggest that removing Brownback doesn’t change the first dimension positions much at all (which is kind of reassuring), but does rearrange speakers on the second.
It’s quite possible that this second ‘dimension’ is in general just idiosyncratic topics that speakers like and which generate variation in word counts without being really ‘dimensional’. If that’s true then Brownback is not really a special case, and we should put him back in and stop trying to figure out what the second dimension ‘really’ is. But we’d need to dig in further than this to establish that conclusion.