Multi-word verbs (e.g. “set out”, “agree on” and “take off”) or names (e.g. “United Kingdom” and “New York”) are very important features of texts, but it is often difficult to keep them in bag-of-words text analysis, because tokenizers usually break up strings by spaces. You can preprocess texts to concatenate multi-word features with underscores like “set_up” or “United_Kingdom”, but we can also postprocess texts using the functions that we recently added to quanteda.
For example, we can extract sequences of capitalized words to find multi-word names. Here, sequences() extract all the contiguous collocation of capitalized words (specified by the regular expression of ^[A-Z]) in the Guardian corpus, and tests statistical significance of the these.
cops <- corpus_subset(data_corpus_guardian, year == 2015) toks <- tokens(cops) toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE) seqs_cap <- sequences(toks, '^[A-Z]', valuetype = 'regex', case_insensitive = FALSE)
The number of sequences discovered by the function is 94009. The top 20 features are names of public figures, places or institutions:
> head(seqs_cap, 20) sequence lambda sigma count length z p 1 David Cameron 15.36614 0.0003056257 7227 2 50277.652 0 2 New York 13.83117 0.0006466388 4939 2 21389.333 0 3 David Cameron's 13.49693 0.0007776048 1163 2 17357.053 0 4 George Osborne 13.34385 0.0008286560 3773 2 16103.003 0 5 White House 13.19858 0.0008903639 3609 2 14823.799 0 6 Guardian Australia 13.08636 0.0009420314 2890 2 13891.643 0 7 Tony Abbott 12.97257 0.0009974100 3003 2 13006.255 0 8 John McDonnell 12.93561 0.0010292881 630 2 12567.528 0 9 Downing Street 12.89980 0.0010351439 2273 2 12461.841 0 10 John Kerry 12.89476 0.0010503205 610 2 12276.973 0 11 John Lewis 12.72067 0.0011469714 484 2 11090.659 0 12 Wall Street 12.43097 0.0013087624 1379 2 9498.266 0 13 Jeremy Corbyn 12.40280 0.0013240339 2998 2 9367.434 0 14 Islamic State 12.37286 0.0013405189 3746 2 9229.905 0 15 Peter Dutton 12.29021 0.0014115279 617 2 8707.028 0 16 Labour MP 12.18272 0.0014765743 735 2 8250.666 0 17 United States 12.17668 0.0014785329 3528 2 8235.651 0 18 European Union 12.13947 0.0015081139 1687 2 8049.442 0 19 New Zealand 12.11735 0.0015205357 1296 2 7969.132 0 20 Labour MPs 12.06252 0.0015683187 1492 2 7691.372 0
The top features are comprised of two words, but there are also sequences longer than two words:
> head(seqs_cap[seqs_cap$length > 2,], 20) sequence lambda sigma count length z p 236 New York Times 11.967518 0.004219489 1024 3 2836.2483 0 299 New York City 11.779066 0.004602973 737 3 2559.0126 0 375 New South Wales 11.470771 0.005012710 885 3 2288.3376 0 637 Human Rights Watch 12.248531 0.007284579 484 3 1681.4331 0 749 European Central Bank 11.252770 0.007277545 1153 3 1546.2315 0 954 Human Rights Commission 11.351519 0.008337609 335 3 1361.4838 0 971 Small Business Network 11.033481 0.008178077 587 3 1349.1533 0 1839 International Monetary Fund 10.225388 0.010905306 950 3 937.6525 0 1991 Human Rights Act 10.247406 0.011462433 164 3 893.9992 0 2172 National Security Agency 9.660554 0.011392991 243 3 847.9383 0 2240 Black Lives Matter 11.001923 0.013261272 364 3 829.6281 0 2558 Public Health England 10.214219 0.013236491 274 3 771.6712 0 2570 US Federal Reserve 10.205464 0.013257373 394 3 769.7954 0 2577 British Medical Association 9.422186 0.012263582 222 3 768.3062 0 2714 President Barack Obama 10.259552 0.013771345 308 3 744.9927 0 2767 Sir John Chilcot 9.372040 0.012795778 118 3 732.4322 0 2903 Guardian Small Business Network 13.556968 0.019098204 563 4 709.8557 0 2944 Wall Street Journal 9.910653 0.014089506 469 3 703.4067 0 2954 World Health Organisation 9.733951 0.013877557 409 3 701.4168 0 3330 Small Business Showcase 10.313611 0.015764044 176 3 654.2490 0
If you want to keep elements of multi-word features, you can concatenate them with tokens_compoud(). Here I select only sequences that appear more than 10 times in the corpus (p-values are not a good selection criteria in a large dataset).
seqs_cap_sub <- seqs_cap[seqs_cap$count > 10,] toks2 <- tokens_compound(toks, seqs_cap_sub, valuetype = 'fixed', case_insensitive = FALSE)