New paper on Russia’s international propaganda during the Ukraine crisis

Standard

My paper on Russia’s international propaganda during the Ukraine crisis, The spread of the Kremlin’s narratives by a western news agency during the Ukraine crisis, is published in the Journal of International Communication. This is very timely, because people are talking about spread of “fake news”!

The description of the Ukraine crisis as an ‘information war’ in recently published studies seems to suggest a belief that the Russian government’s propaganda in the crisis contributed to Russia’s swift annexation of Crimea. However, studies focusing on Russia’s state-controlled media fail to explain how Russian’s narrative spread beyond the ‘Slavic world’. This study, based on quantitative and qualitative analyses of news coverage by ITAR-TASS, Reuters, the AP, and AFP over two years, reveals that Russian’s narratives were internationally circulated in news stories published by a western news agency. Although this by no means suggests that the western news agency was complicit in Russia’s propaganda effort, these news stories were published on the most popular online news sites, such as Yahoo News and Huffington Post. These findings highlight the vulnerability of today’s global news-gathering and distribution systems, and the rapid changes in relationships between states and corporations in the media and communications industry.

Handling multi-word features in R

Standard

Multi-word verbs (e.g. “set out”, “agree on” and “take off”) or names (e.g. “United Kingdom” and “New York”) are very important features of texts, but it is often difficult to keep them in bag-of-words text analysis, because tokenizers usually break up strings by spaces. You can preprocess texts to concatenate multi-word features with underscores like “set_up” or “United_Kingdom”, but we can also postprocess texts using the functions that we recently added to quanteda.

For example, we can extract sequences of capitalized words to find multi-word names. Here, sequences() extract all the contiguous collocation of capitalized words (specified by the regular expression of ^[A-Z]) in the Guardian corpus, and tests statistical significance of the these.

cops <- corpus_subset(data_corpus_guardian, year == 2015)
toks <- tokens(cops)
toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)
seqs_cap <- sequences(toks, '^[A-Z]', valuetype = 'regex', case_insensitive = FALSE)

The number of sequences discovered by the function is 94009. The top 20 features are names of public figures, places or institutions:

> head(seqs_cap, 20)
             sequence   lambda        sigma count length         z p
1       David Cameron 15.36614 0.0003056257  7227      2 50277.652 0
2            New York 13.83117 0.0006466388  4939      2 21389.333 0
3     David Cameron's 13.49693 0.0007776048  1163      2 17357.053 0
4      George Osborne 13.34385 0.0008286560  3773      2 16103.003 0
5         White House 13.19858 0.0008903639  3609      2 14823.799 0
6  Guardian Australia 13.08636 0.0009420314  2890      2 13891.643 0
7         Tony Abbott 12.97257 0.0009974100  3003      2 13006.255 0
8      John McDonnell 12.93561 0.0010292881   630      2 12567.528 0
9      Downing Street 12.89980 0.0010351439  2273      2 12461.841 0
10         John Kerry 12.89476 0.0010503205   610      2 12276.973 0
11         John Lewis 12.72067 0.0011469714   484      2 11090.659 0
12        Wall Street 12.43097 0.0013087624  1379      2  9498.266 0
13      Jeremy Corbyn 12.40280 0.0013240339  2998      2  9367.434 0
14      Islamic State 12.37286 0.0013405189  3746      2  9229.905 0
15       Peter Dutton 12.29021 0.0014115279   617      2  8707.028 0
16          Labour MP 12.18272 0.0014765743   735      2  8250.666 0
17      United States 12.17668 0.0014785329  3528      2  8235.651 0
18     European Union 12.13947 0.0015081139  1687      2  8049.442 0
19        New Zealand 12.11735 0.0015205357  1296      2  7969.132 0
20         Labour MPs 12.06252 0.0015683187  1492      2  7691.372 0

The top features are comprised of two words, but there are also sequences longer than two words:

> head(seqs_cap[seqs_cap$length > 2,], 20)
                            sequence    lambda       sigma count length         z p
236                   New York Times 11.967518 0.004219489  1024      3 2836.2483 0
299                    New York City 11.779066 0.004602973   737      3 2559.0126 0
375                  New South Wales 11.470771 0.005012710   885      3 2288.3376 0
637               Human Rights Watch 12.248531 0.007284579   484      3 1681.4331 0
749            European Central Bank 11.252770 0.007277545  1153      3 1546.2315 0
954          Human Rights Commission 11.351519 0.008337609   335      3 1361.4838 0
971           Small Business Network 11.033481 0.008178077   587      3 1349.1533 0
1839     International Monetary Fund 10.225388 0.010905306   950      3  937.6525 0
1991                Human Rights Act 10.247406 0.011462433   164      3  893.9992 0
2172        National Security Agency  9.660554 0.011392991   243      3  847.9383 0
2240              Black Lives Matter 11.001923 0.013261272   364      3  829.6281 0
2558           Public Health England 10.214219 0.013236491   274      3  771.6712 0
2570              US Federal Reserve 10.205464 0.013257373   394      3  769.7954 0
2577     British Medical Association  9.422186 0.012263582   222      3  768.3062 0
2714          President Barack Obama 10.259552 0.013771345   308      3  744.9927 0
2767                Sir John Chilcot  9.372040 0.012795778   118      3  732.4322 0
2903 Guardian Small Business Network 13.556968 0.019098204   563      4  709.8557 0
2944             Wall Street Journal  9.910653 0.014089506   469      3  703.4067 0
2954       World Health Organisation  9.733951 0.013877557   409      3  701.4168 0
3330         Small Business Showcase 10.313611 0.015764044   176      3  654.2490 0

If you want to keep elements of multi-word features, you can concatenate them with tokens_compoud(). Here I select only sequences that appear more than 10 times in the corpus (p-values are not a good selection criteria in a large dataset).

seqs_cap_sub <- seqs_cap[seqs_cap$count > 10,]
toks2 <- tokens_compound(toks, seqs_cap_sub, valuetype = 'fixed', case_insensitive = FALSE)

Segmentation of Japanese or Chinese texts by stringi

Standard

Use of POS tagger like Mecab and Chasen is considered necessary for segmentation of Japanese texts because words are not separated by spaces like European languages, but I recently learned this is not always the case. When I was testing quanteda‘s tokenization function, I passed a Japanese text to it without much expectation, but the result was very interesting. Japanese words were segmented very nicely! The output was as actual as one from Mecab.

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokens(txt_jp)
tokens from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"           "な"          
 [9] "影響"         "を"           "及"           "ぼ"           "し"           "、"           "社会"         "で"          
[17] "生きる"       "ひとりひとり" "の"           "人"           "の"           "人生"         "に"           "も"          
[25] "様々"         "な"           "影響"         "を"           "及ぼす"       "複雑"         "な"           "領域"        
[33] "で"           "ある"         "。"         

Also, quanteda’s tokenzer segmented Chinese texts very well, but it should not work like this, because there is no POS tagger in the package.

> txt_cn <- "政治是各种团體进行集体决策的一个过程,也是各种团體或个人为了各自的領域所结成的特定关系,尤指對於某一政治實體的統治,例如統治一個國家,亦指對於一國內外事務之監督與管制。"
> quanteda::tokens(txt_cn)
tokens from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"   ","     "也是"  
[14] "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"   "所"     "结成"   "的"     "特定"  
[27] "关系"   ","     "尤"     "指"     "對於"   "某一"   "政治"   "實體"   "的"     "統治"   ","     "例如"   "統治"  
[40] "一個"   "國家"   ","     "亦"     "指"     "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"  
[53] "。"  

The answer to this mystery was found in stringi::stri_split_boundaries, which is the underlying function of quanteda’s tokenizer. stri_split_boundaries utilizes a library called ICU (International Components for Unicode) and the library uses dictionaries for segmentation of texts in Chinese, Japanese, Thai or Khmer. The Japanese dictionary is actually a IPA dictionary, which Mecab also depends on.

This means that those who perform bag-of-words text analysis of Chinese, Japanese, Thai or Khmer texts no longer need to install POS tagger for word segmentation. This would be a massive boost of social scientific text analysis in those languages!

Stringiによる日本語と中国語のテキストの分かち書き

Standard

MecabやChasenなどのによる形態素解析が、日本語のテキストの分かち書きには不可欠だと多くの人が考えていますが、必ずしもそうではないようです。このことを知ったのは、quantedaのトークン化の関数を調べている時で、日本語のテキストをこの関数に渡してみると、単語が Mecabと同じように、きれいに単語に分かれたからです。

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokens(txt_jp)
tokens from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"           "な"          
 [9] "影響"         "を"           "及"           "ぼ"           "し"           "、"           "社会"         "で"          
[17] "生きる"       "ひとりひとり" "の"           "人"           "の"           "人生"         "に"           "も"          
[25] "様々"         "な"           "影響"         "を"           "及ぼす"       "複雑"         "な"           "領域"        
[33] "で"           "ある"         "。" 

quantedaには、形態素解析の機能がないのですが、そのトークン化関数は、中国語のテキストもきれいに、分かち書きをしたのは意外でした。

> txt_cn <- "政治是各种团體进行集体决策的一个过程,也是各种团體或个人为了各自的領域所结成的特定关系,尤指對於某一政治實體的統治,例如統治一個國家,亦指對於一國內外事務之監督與管制。"
> quanteda::tokens(txt_cn)
tokens from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"   ","     "也是"  
[14] "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"   "所"     "结成"   "的"     "特定"  
[27] "关系"   ","     "尤"     "指"     "對於"   "某一"   "政治"   "實體"   "的"     "統治"   ","     "例如"   "統治"  
[40] "一個"   "國家"   ","     "亦"     "指"     "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"  
[53] "。"  

もっと調べてみると、この不思議な挙動は、トークン化関数が基づくstringi::stri_split_boundariesがICU (International Components for Unicode) を利用しており、そこでは中国語、日本語、タイ語、クメール語は、辞書に基づく分かち書きをされているからで、その日本語辞書は Mecabでも利用されているIPA辞書だからということがわかりました。

stringiによる分かち書きができるとすれば、構文分析を用いない日本語や中国語のテキスト分析においては、形態素解析を利用する必要がないということなので、技術的な障壁が下がり、日本語や中国語のテキストの社会科学的な分析がこれから広まるでしょう。

Visualizing media representation of the world

Standard

I uploaded an image visualizing foreign news coverage early this year, but I found that the the image is very difficult to interpret, because both large positive and negative values are important in SVD. Large positive values can be results of intense media attention, but what large negative values mean?

A solution to this problem is use of non-negative matrix factorization (NMF). As the name of the algorithm suggests, matrices decomposed by NMF is restricted to be all positive, so much easier to interpret. I used this technique to to visualize media representation of the world the Guardian, the BBC, and The Irish Times in 2012-2016. I first created a large matrix from co-occurrences of countries in news stories, and I reduced the dimension of the secondary countries (columns) to five.

Rows of the matrices are sorted in order of the raw frequency counts. Australia (AU), China (CN), Japan (JP), India (IN) are highlighted, because I was interested in how the UK was represented by the media in relation to non-EU countries in this analysis.

brexit_heatmap_all

I can identify columns that can be labeled ‘home-centric, ‘European’, ‘the Middle East’ and ‘Asia-Pacific’ clusters based on the prominent countries. In the Guardian, the home-centric cluster is G1 because Britain is the single most important country as the very dark color of the cell shows. The European cluster is G3, because Germany, Greece, Spain, Italy and Ireland have dark cells in the column. The Middle East cluster is G2, in which we find dark cells for Syria (SY), Iran (IR), Afghanistan (AF) and Israel (IL). The Asia-Pacific cluster is G5, where China (CN), Australia (AU), Japan (JP), India (IN) and United States (US) and Canada (CA) are prominent. In the BBC, the home-centric cluster is G1, and the European cluster is G4, where France, Greece, Spain and Germany are found, although the United States are also found. The Middle East cluster is G2, which includes Syria, Iraq, Egypt (EG), and Israel (IL). The Asia-Pacific cluster is G3, where China, the United States, Australia, India and Japan are found. In the Irish newspaper, the home-centric cluster is G1, the European cluster is G4, the Middle East cluster is G3, and the Asia-Pacific cluster is G5.

More detail is available in Britain in the world: Visual analysis of international relations in the British and the Irish media before the referendum presented in the 2016 Political Studies Association Conference.

Visualizing foreign news coverage

Standard

The challenge in international news research is identifying patterns in foreign news reporting, which cover thousands of events in hundreds of countries, but visualization seems to be useful. This chat summarizes foreign news coverage by the New York Times between 2012 and 2014 with heatmaps, where rows and columns respectively representing the most frequent countries and the most significant events (or groups of events).

For example, bright cells in rows labeled SY show coverage on the Syrian civil war by the newspaper. In 2012, the civil war was the fifth most important event (E5), and it then became the second most important event (E2) in 2013, although the newspaper shifted its focus to the Ukraine crisis (E2) in 2014, making the Syrian civil war insignificant. In 2015, however, the Syrian civil war became the second most important event again. Further, in this year, Syria is also high in E8, which is the European refugee crisis, along with Greece (Greece), Iraq (IQ), and the UK (GB). Germany (DE) is also significant in the event, although the country does not appear in the chart as it is only 12th from the top.

heatmap5_nyt

This chart was created from news summaries geographically classified by Newsmap. For each of the years, a date-country matrix was created and its date dimension were reduced by SVD to identify the 10 most important events. Scores for countries were centered and normalized for each event.

Best paper award at ICA methodology pre-conference

Standard

I presented my paper on geographical classification, in the methodology pre-conference at ICA in Fukuoka, Japan. The pre-conference has historical significance as the first methodology group at a major international conference of media and communication studies. There were a lot of interesting presentations, but, to my big surprise, I won a Best Paper Award from the organizer. According to the judges, I revived the award because I had done the most rigorous validation of a new methodology.

Newsmap in R

Standard

I have been using Newsmap in many of my research projects as one of the key tools, but I was not able share the tool with other people as it was a complex Python system. To make the tool widely available, I recently implemented Newsmap in R. The R version is dependent on another text analysis package Quanteda, which I am also contributing to.

Newsmap is back

Link

International Newsmap has been offline for a while due to a restriction imposed by my server hosting company. The number of news stories in Newsmap has been increasing since 2011, and the company decided to disable the database. I left it offline, lacking motivation to restore the system, but an email from an email from a student in an international journalism course in Portugal encourage me to do the job.

I moved the website to a new, more powerful server and upgraded its user interface as well as the system behind. The media sources are reduced to 11 newspapers and satellite news, which I think interesting. Geographical classifier is updated to the latest version, and training is solely based on stories collected from Yahoo News.

In the new user interface, there is not button to switch between daily, monthly or yearly, but if you click + signs next to the dates in the right-hand side, you can see monthly or daily figures.

Analysis of Russian media

Standard

Application of the techniques developed with English language texts to other languages is not so easy, but I managed to adapt my LSS system to Russia language for a project on Russian media framing of street protests. In the project, I am responsible for data collection and analysis of Russian language news collected from state-controlled media between 2011-2014. The dictionary that I created measures a dimension of protests as freedom of expression vs. social disorder as good as human coders do. The detail of the dictionary construction procedure is available in one of my posts. I will keep positing to the project blog.