Analyzing Asian texts in R on English Windows machines

Standard

R is generally good with Unicode, and we do not see garbled texts as far as we use stringi package. But there are some known bugs. The worst is probably the bug that have been discussed on the online community.

On Windows, R prints character vectors properly, but not character vectors in data.frame:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

> txt <- "あいうえお" # Japanese text
> print(txt)
[1] "あいうえお" # good
> print(data.frame(txt))
                                       txt
1 <U+3042><U+3044><U+3046><U+3048><U+304A> # not good

While Ista Zahn’s interesting post only shows the depth of the problem, there is a solution (or work around) that you can try:

First, set language for non-Unicode programs in Windows’ Control Panel > Clock, Language, and Region > Language > Language for non-Unicode program > Change system locale.

Second, set locale in R script:

> Sys.setlocale("LC_CTYPE", locale="Japanese") # set locale
[1] "Japanese_Japan.932"

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=Japanese_Japan.932 # changed          
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

Then, the Japanese text in a data.frame is printed correctly:

> txt <- "あいうえお" # Japanese text
> print(txt)
[1] "あいうえお" # good
> print(data.frame(txt))
         txt
1 あいうえお # good

R and Python text analysis packages performance comparison

Standard

Like many other people, I started text analysis in Python, because R was notoriously slow. Python looked like a perfect language for text analysis, and I did a lot of work during my PhD using gensim with home-grown tools. I loved gensim’s LSA that quickly and consistently decomposes very large document-feature matrices.

However, I faced a memory crisis in Python when the size of data for my projects continued grow, reaching 500MB. When 500MB of texts are tokenized in Python, it took nearly 10GB of RAM. This problem is deep-rooted in Python’s list object that is used to store character strings. The only solution seemed to convert all the tokens into integers (serialization) in early stages of the text processing, but development of such platform in Python is a huge undertaking.

After joining the quanteda team last summer, I have spent a lot of time to improve its performance in a new architecture. I implemented up-front serialization (tokens function), and wrote a bunch of multi-thread functions to modify the serialized data in C++ (many of the tokens_* functions). If tokens are serialized, creation of a sparse document-feature matrix (dfm) is quick and easy.

The performance gain of quanteda’s new architecture became apparent in the head-to-head comparison with gensim. Quanteda’s execution time is around 50% shorter, and peak memory consumption is 40% smaller than gensim.


The data used for this benchmarking is a corpus of 117,942 news stories published by London Times. The operations are reading texts from a disk, tokenizing, removing stopwords, and constructing a sparse document-feature matrix. Execution time and peak memory consumption are obtained from ‘Elapsed time’ and ‘Maximum resident set size’ in GNU’s time command. The package versions are 0.9.9.48 (quanteda) and 2.0.0 (gensim).

R

#!/usr/bin/Rscript
require(quanteda)
require(stringi)

cat("Read files\n")
txts <- stri_read_lines('data.txt') # using stri_read_lines because readLines is very slow

cat("Tokenize texts\n")
toks = tokens(txts, what = "fastestword") # 'fastestword' means spliting text by spaces

cat("Remove stopwords\n")
toks = tokens_remove(toks, stopwords('english'))

mx <- dfm(toks)

cat(nfeature(mx), "unique types\n")

Python

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import division
import os, sys, string, codecs
from gensim import corpora, models


if __name__ == "__main__":
    
    print "Read files"
    txts = []
    with codecs.open('data.txt', 'r', 'utf-8') as fp:
        for txt in fp:
            if len(txt.strip()) > 0:
                txts.append(txt.strip())
                
    # stopwords are imported from quanteda            
    stopwords = set(["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", 
                     "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", 
                     "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", 
                     "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "would", 
                     "should", "could", "ought", "i'm", "you're", "he's", "she's", "it's", "we're", "they're", "i've", "you've", "we've", 
                     "they've", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", 
                     "isn't", "aren't", "wasn't", "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", 
                     "shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "let's", "that's", "who's", "what's", "here's", "there's", 
                     "when's", "where's", "why's", "how's", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", 
                     "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", 
                     "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", 
                     "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", 
                     "only", "own", "same", "so", "than", "too", "very"])
    
    print "Tokenize and remove stopwords"
    toks = [[tok for tok in txt.lower().split() if tok not in stopwords] for txt in txts]
    
    print "Serialize tokens"
    dic = corpora.Dictionary(toks)
    
    print "Construct a document-feature matrix"
    mx = [dic.doc2bow(tok) for tok in toks]
    
    print(dic)

Paper on how to measure news bias by quantitative text analysis

Standard

My paper titled Measuring news bias: Russia’s official news agency ITAR-TASS’s coverage of the Ukraine crisis is published in the European Journal of Communication.
In this piece, I estimated how much the news coverage of Ukraine crisis by ITAR-TASS was biased by the influence of the Russian government with quantitative text analysis techniques:

Objectivity in news reporting is one of the most widely discussed topics in journalism, and numbers of studies on bias in news have been conducted, but there is little agreement on how to define or measure news bias. Aiming to settle the theoretical and methodological disagreement, the author redefined news bias and applied a new methodology to detect the Russian government’s influence on ITAR-TASS during the Ukraine crisis. A longitudinal content analysis of over 35,000 English-language newswires on the Ukraine crisis published by ITAR-TASS and Interfax clearly showed that ITAR-TASS’s framing of Ukraine was reflecting desirability of pivotal events in the crisis to the Russian government. This result reveals Russia’s strategic use of the state-owned news agency for international propaganda in its ‘hybrid war’, demonstrating the effectiveness of the new approach to news bias.

Newsmap paper in Digital Journalism

Standard

My paper on geographical news classification is finally published in Digital Journalism, a sister journal of Journalism Studies. In this paper, I not only evaluate Newsmap’s classification accuracy, but compare it with other tools such as Open Calais and Geoparser.io.

This paper presents the results of an evaluation of three different types of geographical news classification methods: (1) simple keyword matching, a popular method in media and communications research; (2) geographical information extraction systems equipped with named-entity recognition and place name disambiguation mechanisms (Open Calais and Geoparser.io); and (3) a semi-supervised machine learning classifier developed by the author (Newsmap). Newsmap substitutes manual coding of news stories with dictionary-based labelling in the creation of large training sets to extract large numbers of geographical words without human involvement and it also identifies multi-word names to reduce the ambiguity of the geographical traits fully automatically. The evaluation of classification accuracy of the three types of methods against 5000 human-coded news summaries reveals that Newsmap outperforms the geographical information extraction systems in overall accuracy, while the simple keyword matching suffers from ambiguity of place names in countries with ambiguous place names.

New paper on Russia’s international propaganda during the Ukraine crisis

Standard

My paper on Russia’s international propaganda during the Ukraine crisis, The spread of the Kremlin’s narratives by a western news agency during the Ukraine crisis, is published in the Journal of International Communication. This is very timely, because people are talking about spread of “fake news”!

The description of the Ukraine crisis as an ‘information war’ in recently published studies seems to suggest a belief that the Russian government’s propaganda in the crisis contributed to Russia’s swift annexation of Crimea. However, studies focusing on Russia’s state-controlled media fail to explain how Russian’s narrative spread beyond the ‘Slavic world’. This study, based on quantitative and qualitative analyses of news coverage by ITAR-TASS, Reuters, the AP, and AFP over two years, reveals that Russian’s narratives were internationally circulated in news stories published by a western news agency. Although this by no means suggests that the western news agency was complicit in Russia’s propaganda effort, these news stories were published on the most popular online news sites, such as Yahoo News and Huffington Post. These findings highlight the vulnerability of today’s global news-gathering and distribution systems, and the rapid changes in relationships between states and corporations in the media and communications industry.

Handling multi-word features in R

Standard

Multi-word verbs (e.g. “set out”, “agree on” and “take off”) or names (e.g. “United Kingdom” and “New York”) are very important features of texts, but it is often difficult to keep them in bag-of-words text analysis, because tokenizers usually break up strings by spaces. You can preprocess texts to concatenate multi-word features with underscores like “set_up” or “United_Kingdom”, but we can also postprocess texts using the functions that we recently added to quanteda.

For example, we can extract sequences of capitalized words to find multi-word names. Here, sequences() extract all the contiguous collocation of capitalized words (specified by the regular expression of ^[A-Z]) in the Guardian corpus, and tests statistical significance of the these.

cops <- corpus_subset(data_corpus_guardian, year == 2015)
toks <- tokens(cops)
toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)
seqs_cap <- sequences(toks, '^[A-Z]', valuetype = 'regex', case_insensitive = FALSE)

The number of sequences discovered by the function is 94009. The top 20 features are names of public figures, places or institutions:

> head(seqs_cap, 20)
             sequence   lambda        sigma count length         z p
1       David Cameron 15.36614 0.0003056257  7227      2 50277.652 0
2            New York 13.83117 0.0006466388  4939      2 21389.333 0
3     David Cameron's 13.49693 0.0007776048  1163      2 17357.053 0
4      George Osborne 13.34385 0.0008286560  3773      2 16103.003 0
5         White House 13.19858 0.0008903639  3609      2 14823.799 0
6  Guardian Australia 13.08636 0.0009420314  2890      2 13891.643 0
7         Tony Abbott 12.97257 0.0009974100  3003      2 13006.255 0
8      John McDonnell 12.93561 0.0010292881   630      2 12567.528 0
9      Downing Street 12.89980 0.0010351439  2273      2 12461.841 0
10         John Kerry 12.89476 0.0010503205   610      2 12276.973 0
11         John Lewis 12.72067 0.0011469714   484      2 11090.659 0
12        Wall Street 12.43097 0.0013087624  1379      2  9498.266 0
13      Jeremy Corbyn 12.40280 0.0013240339  2998      2  9367.434 0
14      Islamic State 12.37286 0.0013405189  3746      2  9229.905 0
15       Peter Dutton 12.29021 0.0014115279   617      2  8707.028 0
16          Labour MP 12.18272 0.0014765743   735      2  8250.666 0
17      United States 12.17668 0.0014785329  3528      2  8235.651 0
18     European Union 12.13947 0.0015081139  1687      2  8049.442 0
19        New Zealand 12.11735 0.0015205357  1296      2  7969.132 0
20         Labour MPs 12.06252 0.0015683187  1492      2  7691.372 0

The top features are comprised of two words, but there are also sequences longer than two words:

> head(seqs_cap[seqs_cap$length > 2,], 20)
                            sequence    lambda       sigma count length         z p
236                   New York Times 11.967518 0.004219489  1024      3 2836.2483 0
299                    New York City 11.779066 0.004602973   737      3 2559.0126 0
375                  New South Wales 11.470771 0.005012710   885      3 2288.3376 0
637               Human Rights Watch 12.248531 0.007284579   484      3 1681.4331 0
749            European Central Bank 11.252770 0.007277545  1153      3 1546.2315 0
954          Human Rights Commission 11.351519 0.008337609   335      3 1361.4838 0
971           Small Business Network 11.033481 0.008178077   587      3 1349.1533 0
1839     International Monetary Fund 10.225388 0.010905306   950      3  937.6525 0
1991                Human Rights Act 10.247406 0.011462433   164      3  893.9992 0
2172        National Security Agency  9.660554 0.011392991   243      3  847.9383 0
2240              Black Lives Matter 11.001923 0.013261272   364      3  829.6281 0
2558           Public Health England 10.214219 0.013236491   274      3  771.6712 0
2570              US Federal Reserve 10.205464 0.013257373   394      3  769.7954 0
2577     British Medical Association  9.422186 0.012263582   222      3  768.3062 0
2714          President Barack Obama 10.259552 0.013771345   308      3  744.9927 0
2767                Sir John Chilcot  9.372040 0.012795778   118      3  732.4322 0
2903 Guardian Small Business Network 13.556968 0.019098204   563      4  709.8557 0
2944             Wall Street Journal  9.910653 0.014089506   469      3  703.4067 0
2954       World Health Organisation  9.733951 0.013877557   409      3  701.4168 0
3330         Small Business Showcase 10.313611 0.015764044   176      3  654.2490 0

If you want to keep elements of multi-word features, you can concatenate them with tokens_compoud(). Here I select only sequences that appear more than 10 times in the corpus (p-values are not a good selection criteria in a large dataset).

seqs_cap_sub <- seqs_cap[seqs_cap$count > 10,]
toks2 <- tokens_compound(toks, seqs_cap_sub, valuetype = 'fixed', case_insensitive = FALSE)

Segmentation of Japanese or Chinese texts by stringi

Standard

Use of POS tagger like Mecab and Chasen is considered necessary for segmentation of Japanese texts because words are not separated by spaces like European languages, but I recently learned this is not always the case. When I was testing quanteda‘s tokenization function, I passed a Japanese text to it without much expectation, but the result was very interesting. Japanese words were segmented very nicely! The output was as actual as one from Mecab.

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokens(txt_jp)
tokens from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"           "な"          
 [9] "影響"         "を"           "及"           "ぼ"           "し"           "、"           "社会"         "で"          
[17] "生きる"       "ひとりひとり" "の"           "人"           "の"           "人生"         "に"           "も"          
[25] "様々"         "な"           "影響"         "を"           "及ぼす"       "複雑"         "な"           "領域"        
[33] "で"           "ある"         "。"         

Also, quanteda’s tokenzer segmented Chinese texts very well, but it should not work like this, because there is no POS tagger in the package.

> txt_cn <- "政治是各种团體进行集体决策的一个过程,也是各种团體或个人为了各自的領域所结成的特定关系,尤指對於某一政治實體的統治,例如統治一個國家,亦指對於一國內外事務之監督與管制。"
> quanteda::tokens(txt_cn)
tokens from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"   ","     "也是"  
[14] "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"   "所"     "结成"   "的"     "特定"  
[27] "关系"   ","     "尤"     "指"     "對於"   "某一"   "政治"   "實體"   "的"     "統治"   ","     "例如"   "統治"  
[40] "一個"   "國家"   ","     "亦"     "指"     "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"  
[53] "。"  

The answer to this mystery was found in stringi::stri_split_boundaries, which is the underlying function of quanteda’s tokenizer. stri_split_boundaries utilizes a library called ICU (International Components for Unicode) and the library uses dictionaries for segmentation of texts in Chinese, Japanese, Thai or Khmer. The Japanese dictionary is actually a IPA dictionary, which Mecab also depends on.

This means that those who perform bag-of-words text analysis of Chinese, Japanese, Thai or Khmer texts no longer need to install POS tagger for word segmentation. This would be a massive boost of social scientific text analysis in those languages!

Stringiによる日本語と中国語のテキストの分かち書き

Standard

MecabやChasenなどのによる形態素解析が、日本語のテキストの分かち書きには不可欠だと多くの人が考えていますが、必ずしもそうではないようです。このことを知ったのは、quantedaのトークン化の関数を調べている時で、日本語のテキストをこの関数に渡してみると、単語が Mecabと同じように、きれいに単語に分かれたからです。

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokens(txt_jp)
tokens from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"           "な"          
 [9] "影響"         "を"           "及"           "ぼ"           "し"           "、"           "社会"         "で"          
[17] "生きる"       "ひとりひとり" "の"           "人"           "の"           "人生"         "に"           "も"          
[25] "様々"         "な"           "影響"         "を"           "及ぼす"       "複雑"         "な"           "領域"        
[33] "で"           "ある"         "。" 

quantedaには、形態素解析の機能がないのですが、そのトークン化関数は、中国語のテキストもきれいに、分かち書きをしたのは意外でした。

> txt_cn <- "政治是各种团體进行集体决策的一个过程,也是各种团體或个人为了各自的領域所结成的特定关系,尤指對於某一政治實體的統治,例如統治一個國家,亦指對於一國內外事務之監督與管制。"
> quanteda::tokens(txt_cn)
tokens from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"   ","     "也是"  
[14] "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"   "所"     "结成"   "的"     "特定"  
[27] "关系"   ","     "尤"     "指"     "對於"   "某一"   "政治"   "實體"   "的"     "統治"   ","     "例如"   "統治"  
[40] "一個"   "國家"   ","     "亦"     "指"     "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"  
[53] "。"  

もっと調べてみると、この不思議な挙動は、トークン化関数が基づくstringi::stri_split_boundariesがICU (International Components for Unicode) を利用しており、そこでは中国語、日本語、タイ語、クメール語は、辞書に基づく分かち書きをされているからで、その日本語辞書は Mecabでも利用されているIPA辞書だからということがわかりました。

stringiによる分かち書きができるとすれば、構文分析を用いない日本語や中国語のテキスト分析においては、形態素解析を利用する必要がないということなので、技術的な障壁が下がり、日本語や中国語のテキストの社会科学的な分析がこれから広まるでしょう。

Visualizing media representation of the world

Standard

I uploaded an image visualizing foreign news coverage early this year, but I found that the the image is very difficult to interpret, because both large positive and negative values are important in SVD. Large positive values can be results of intense media attention, but what large negative values mean?

A solution to this problem is use of non-negative matrix factorization (NMF). As the name of the algorithm suggests, matrices decomposed by NMF is restricted to be all positive, so much easier to interpret. I used this technique to to visualize media representation of the world the Guardian, the BBC, and The Irish Times in 2012-2016. I first created a large matrix from co-occurrences of countries in news stories, and I reduced the dimension of the secondary countries (columns) to five.

Rows of the matrices are sorted in order of the raw frequency counts. Australia (AU), China (CN), Japan (JP), India (IN) are highlighted, because I was interested in how the UK was represented by the media in relation to non-EU countries in this analysis.

brexit_heatmap_all

I can identify columns that can be labeled ‘home-centric, ‘European’, ‘the Middle East’ and ‘Asia-Pacific’ clusters based on the prominent countries. In the Guardian, the home-centric cluster is G1 because Britain is the single most important country as the very dark color of the cell shows. The European cluster is G3, because Germany, Greece, Spain, Italy and Ireland have dark cells in the column. The Middle East cluster is G2, in which we find dark cells for Syria (SY), Iran (IR), Afghanistan (AF) and Israel (IL). The Asia-Pacific cluster is G5, where China (CN), Australia (AU), Japan (JP), India (IN) and United States (US) and Canada (CA) are prominent. In the BBC, the home-centric cluster is G1, and the European cluster is G4, where France, Greece, Spain and Germany are found, although the United States are also found. The Middle East cluster is G2, which includes Syria, Iraq, Egypt (EG), and Israel (IL). The Asia-Pacific cluster is G3, where China, the United States, Australia, India and Japan are found. In the Irish newspaper, the home-centric cluster is G1, the European cluster is G4, the Middle East cluster is G3, and the Asia-Pacific cluster is G5.

More detail is available in Britain in the world: Visual analysis of international relations in the British and the Irish media before the referendum presented in the 2016 Political Studies Association Conference.

Visualizing foreign news coverage

Standard

The challenge in international news research is identifying patterns in foreign news reporting, which cover thousands of events in hundreds of countries, but visualization seems to be useful. This chat summarizes foreign news coverage by the New York Times between 2012 and 2014 with heatmaps, where rows and columns respectively representing the most frequent countries and the most significant events (or groups of events).

For example, bright cells in rows labeled SY show coverage on the Syrian civil war by the newspaper. In 2012, the civil war was the fifth most important event (E5), and it then became the second most important event (E2) in 2013, although the newspaper shifted its focus to the Ukraine crisis (E2) in 2014, making the Syrian civil war insignificant. In 2015, however, the Syrian civil war became the second most important event again. Further, in this year, Syria is also high in E8, which is the European refugee crisis, along with Greece (Greece), Iraq (IQ), and the UK (GB). Germany (DE) is also significant in the event, although the country does not appear in the chart as it is only 12th from the top.

heatmap5_nyt

This chart was created from news summaries geographically classified by Newsmap. For each of the years, a date-country matrix was created and its date dimension were reduced by SVD to identify the 10 most important events. Scores for countries were centered and normalized for each event.