Handling multi-word features in R

Standard

Multi-word verbs (e.g. “set out”, “agree on” and “take off”) or names (e.g. “United Kingdom” and “New York”) are very important features of texts, but it is often difficult to keep them in bag-of-words text analysis, because tokenizers usually break up strings by spaces. You can preprocess texts to concatenate multi-word features with underscores like “set_up” or “United_Kingdom”, but we can also postprocess texts using the functions that we recently added to quanteda.

For example, we can extract sequences of capitalized words to find multi-word names. Here, sequences() extract all the contiguous collocation of capitalized words (specified by the regular expression of ^[A-Z]) in the Guardian corpus, and tests statistical significance of the these.

cops <- corpus_subset(data_corpus_guardian, year == 2015)
toks <- tokens(cops)
toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)
seqs_cap <- sequences(toks, '^[A-Z]', valuetype = 'regex', case_insensitive = FALSE)

The number of sequences discovered by the function is 94009. The top 20 features are names of public figures, places or institutions:

> head(seqs_cap, 20)
             sequence   lambda        sigma count length         z p
1       David Cameron 15.36614 0.0003056257  7227      2 50277.652 0
2            New York 13.83117 0.0006466388  4939      2 21389.333 0
3     David Cameron's 13.49693 0.0007776048  1163      2 17357.053 0
4      George Osborne 13.34385 0.0008286560  3773      2 16103.003 0
5         White House 13.19858 0.0008903639  3609      2 14823.799 0
6  Guardian Australia 13.08636 0.0009420314  2890      2 13891.643 0
7         Tony Abbott 12.97257 0.0009974100  3003      2 13006.255 0
8      John McDonnell 12.93561 0.0010292881   630      2 12567.528 0
9      Downing Street 12.89980 0.0010351439  2273      2 12461.841 0
10         John Kerry 12.89476 0.0010503205   610      2 12276.973 0
11         John Lewis 12.72067 0.0011469714   484      2 11090.659 0
12        Wall Street 12.43097 0.0013087624  1379      2  9498.266 0
13      Jeremy Corbyn 12.40280 0.0013240339  2998      2  9367.434 0
14      Islamic State 12.37286 0.0013405189  3746      2  9229.905 0
15       Peter Dutton 12.29021 0.0014115279   617      2  8707.028 0
16          Labour MP 12.18272 0.0014765743   735      2  8250.666 0
17      United States 12.17668 0.0014785329  3528      2  8235.651 0
18     European Union 12.13947 0.0015081139  1687      2  8049.442 0
19        New Zealand 12.11735 0.0015205357  1296      2  7969.132 0
20         Labour MPs 12.06252 0.0015683187  1492      2  7691.372 0

The top features are comprised of two words, but there are also sequences longer than two words:

> head(seqs_cap[seqs_cap$length > 2,], 20)
                            sequence    lambda       sigma count length         z p
236                   New York Times 11.967518 0.004219489  1024      3 2836.2483 0
299                    New York City 11.779066 0.004602973   737      3 2559.0126 0
375                  New South Wales 11.470771 0.005012710   885      3 2288.3376 0
637               Human Rights Watch 12.248531 0.007284579   484      3 1681.4331 0
749            European Central Bank 11.252770 0.007277545  1153      3 1546.2315 0
954          Human Rights Commission 11.351519 0.008337609   335      3 1361.4838 0
971           Small Business Network 11.033481 0.008178077   587      3 1349.1533 0
1839     International Monetary Fund 10.225388 0.010905306   950      3  937.6525 0
1991                Human Rights Act 10.247406 0.011462433   164      3  893.9992 0
2172        National Security Agency  9.660554 0.011392991   243      3  847.9383 0
2240              Black Lives Matter 11.001923 0.013261272   364      3  829.6281 0
2558           Public Health England 10.214219 0.013236491   274      3  771.6712 0
2570              US Federal Reserve 10.205464 0.013257373   394      3  769.7954 0
2577     British Medical Association  9.422186 0.012263582   222      3  768.3062 0
2714          President Barack Obama 10.259552 0.013771345   308      3  744.9927 0
2767                Sir John Chilcot  9.372040 0.012795778   118      3  732.4322 0
2903 Guardian Small Business Network 13.556968 0.019098204   563      4  709.8557 0
2944             Wall Street Journal  9.910653 0.014089506   469      3  703.4067 0
2954       World Health Organisation  9.733951 0.013877557   409      3  701.4168 0
3330         Small Business Showcase 10.313611 0.015764044   176      3  654.2490 0

If you want to keep elements of multi-word features, you can concatenate them with tokens_compoud(). Here I select only sequences that appear more than 10 times in the corpus (p-values are not a good selection criteria in a large dataset).

seqs_cap_sub <- seqs_cap[seqs_cap$count > 10,]
toks2 <- tokens_compound(toks, seqs_cap_sub, valuetype = 'fixed', case_insensitive = FALSE)