Geographical dictionary making technique


My new draft paper Newsmap: Dictionary expansion technique for geographical classification of very short longitudinal texts explains how to create a large geographical dictionary for text classification. Its algorithm is an updated version of the International Newsmap, and it is simpler and more statistically grounded. As I am arguing in the paper, this technique could be used to classify not only news stories, but social media posts.

Terrorism Dictionary 2014


After seeing mass media’s strong response to the extremists’ attack against Charlie Hebdo, I started thinking what can I do for this increasingly important topic? One simple work is making a dictionary containing keywords related to terrorism, so the Terrorism Dictionary 2014 is created. This dictionary is made from newswires submitted by the Associated Press and Agence France-Presse in 2014 using the collocation-of-collocation technique.

The following top-30 keywords contain a lot of nasty words, although there is a proper noun as a result of failure of name entity recognition.

terrorist          854.315998
terrorism          588.953735
terror             481.302104
terrorists         218.965274
attacks            210.076406
group              166.118157
groups             149.240685
militant           119.637842
murder             116.232324
criminal           110.881245
charges            110.290402
extremist          103.736008
organisation       98.508253
jihadist           97.517677
jihadists          91.178175
al-qaida           86.847945
threat             83.710184
acts               82.083663
organization       82.043942
strikes            81.820983
militants          81.399320
violence           78.733125
violent            73.914122
terrorism-related  73.734030
guilty             73.433767
attack             69.712507
fight              66.956416
extremists         64.436517
links              63.727690
charged            63.267324

This list of keywords can be used to find news stories or Twitter posts about terrorism. For example, if an item contain more than three of the keywords among the top 100 in the dictionary, it is very likely to be about terrorism.

Left-right policy position dictionary


The Latent Semantic Scaling (LSS) not only works well with positive-negative sentiment but with left-right position on economic policy. The seed words for this dimension are {deficit, austerity, unstable, recession, inflation, currency, workforce} for the light and {poor, poverty, free, benefits, prices, money, workers} for the left.

Left-right policy position dictionary was created from UK and Irish news corpus from 1996-1997. The first chart is the replication of the Wordscore paper by Benoit and Laver, and black and red letters represent Irish and UK parties.

UK and IE 1997 manifestos

The second chart is the result of the machine coding of UK party manifestos from 1987 to 2010 by the same dictionary, and it is showing clear separation of the leftist and rightist parties until 2005. Why there is not difference between the three parties in the 2010? It is arguably because their economic policy became very similar after the economic crisis from the perspective of 1990s politics.

UK party manifestos 1987-2010

Immigration dictionary


This is probably the final version of my immigration dictionary. This text analysis dictionary was created using technique called the Latent Semantic Scaling, which is based on the Latent Semantic Analysis, from British newspaper corpus.

The result of the automated content analysis by this dictionary is strongly corresponds to manual coding by Amazon’s Mechanical Turks as you can see in the chart (whiskers represent 95% confidence intervals). Yet, please note that the documents coded by the dictionary are only sentences about immigration in the party manifestos selected by keywords (‘immigra*’, ‘migra*’, ‘refugee*’, ‘asylum*’, ‘foreign*’).

UK 2010 manifestos on immigration

The dictionary is made up of 750 entry words. The following is the top 30 most positive and negative words in the dictionary. Many of them are intuitively positive or negative, but some are not. For example, ‘globalisation’ is positive only in the context of immigration. This is why texts are restricted to sentences on this subject. We can spot words like ‘species’ and ‘wildebeest’, because the newspaper corpus contains stories about animal migration, but it is not too harmful.

# Positive words

1   skills            100
2   globalisation     88.24
3   chauffeured       86.93
4   airport           86.68
5   ranging           82.41
6   clearance         79.48
7   status            78.4
8   agency            74.98
9   issues            72.15
10  breed             69.45
11  claimed           68.84
12  vehemently        68.6
13  skill             67.3
14  test              65.91
15  attract           64.39
16  permanent         63.68
17  legal             59.23
18  melting-pot       57.34
19  species           57.27
20  wildebeest        56.96
21  overstaying       56.07
22  documents         55.9
23  routes            55.75
24  work              55.63
25  shambles          55.28
26  breeding          53.65
27  bringing          53.24
28  employ            52.76
29  passport          52.24
30  official          51.88

# Negative words

1   xenophobia        -141.27
2   control           -130.09
3   racist            -125.2
4   stemming          -122.5
5   tide              -122.46
6   working-class     -115.53
7   negative          -113.76
8   failure           -110.32
9   problems          -106.95
10  influx            -100.81
11  branded           -99.42
12  caused            -96.82
13  exploit           -94.11
14  first-generation  -90.78
15  warned            -89.93
16  families          -88.51
17  soaring           -86.53
18  ignored           -86.45
19  housed            -85.33
20  magnet            -84.47
21  borders           -83.18
22  newly-arrived     -83.12
23  accused           -82.89
24  evicted           -82.02
25  trickle           -81.42
26  rates             -79.42
27  fuelled           -78.34
28  flooded           -76.69
29  non-white         -76.48
30  lorries           -76.38

Text analysis dictionary on psychology


My automated dictionary creation project is making good progress, and I created a psychology dictionary from a large corpus of UK news on psychology from 1990 to 2011. Scores given to each entry word is interpreted as strength of association to psychology, and the list can be truncated based on the scores.

The words are extracted using a technique that I call the collocation-of-collocation. In this technique, a pattern ‘psycholog*’ that matches ‘psychology’, ‘psychologist’, ‘psychological’, and ‘psychologically’ is given to the system, and it finds collocations of those words. Then, those collocations are used to extract words that are semantically close to psychology. This technique is meant to overcome the limitation of collocation analysis in synonym extraction that words that have the same meaning do not co-occur.

Testing immigration dictionary


After making some changes in my automated dictionary creation system, I ran a test to validate the word choice for the new immigration dictionary. Latest version contains fewer intuitively negative words with positive scores, unlike the original version.

The test was performed by comparing the computer content-analysis with human coding of the 2010 UK manifestos. X is the automated coding by the dictionary and Y is the human coding. Green and Conservative are off the 45-degree line, but still the automated coding is strongly corresponding to human coding.

UK 2010

Text analysis dictionary on immigration policy


Dictionary-based text analysis has a number of good properties, but it is always difficult to make a new dictionary and text analysts often use existing dictionaries that include the General Inquirer dictionaries, which are originally created decades ago, or their derivatives. However, I believe that it is time to create new dictionaries from scratch using a number of tools and techniques available to us.

My first original dictionary is the UK Immigration Dictionary. It is meant to measure attitude toward immigration to the UK. The words contains counter intuitive positive entities such as ‘racist’, but the result becomes as follows when applied to the 2010 UK party manifestos.

BNP          -0.660772785
Coalition     0.403547905
Conservative  0.002508397
Greens       -0.898075732
Labour        0.081029432
LibDem        0.050535076
PC           -0.015306746
SNP          -0.551027977
UKIP         -0.335952325

I am not yet sure how accurate this is, but it looks interesting since small parties, which tend to be against immigration, are all negative.

It is very easy to used the dictionary in R using Quanteda:

df.temp <- read.csv(file="news.dictionary.tfidf.500.csv",  header=FALSE, sep='\t')
df.dict <- data.frame(word=as.character(df.temp$V1), score=as.numeric(df.temp$V2))

uk2010immigCorpus <- corpus(uk2010immig,
                            notes="Immigration-related sections of 2010 UK party manifestos",
mx <- tfidf(dfm(uk2010immigCorpus))
mx2 <-, colnames(mx) %in% df.dict$word))) #Remove columns not in the dictionary

# Make a list in the same order as the columns
v.dict <- list()
for(word in colnames(mx2)){
  v.dict[[word]] <- df.dict$score[df.dict$word==word]
  #v.dict[[word]] <- ifelse(df.dict$score[df.dict$word==word] > 0, 1, -1)

print(as.matrix(mx2) %*% as.matrix(unlist(v.dict)))