Segmentation of Japanese or Chinese texts by stringi

Standard

Use of POS tagger like Mecab and Chasen is considered necessary for segmentation of Japanese texts because words are not separated by spaces like European languages, but I recently learned this is not always the case. When I was testing quanteda‘s tokenization function, I passed a Japanese text to it without much expectation, but the result was very interesting. Japanese words were segmented very nicely! The output was as actual as one from Mecab.

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokens(txt_jp)
tokens from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"           "な"          
 [9] "影響"         "を"           "及"           "ぼ"           "し"           "、"           "社会"         "で"          
[17] "生きる"       "ひとりひとり" "の"           "人"           "の"           "人生"         "に"           "も"          
[25] "様々"         "な"           "影響"         "を"           "及ぼす"       "複雑"         "な"           "領域"        
[33] "で"           "ある"         "。"         

Also, quanteda’s tokenzer segmented Chinese texts very well, but it should not work like this, because there is no POS tagger in the package.

> txt_cn <- "政治是各种团體进行集体决策的一个过程,也是各种团體或个人为了各自的領域所结成的特定关系,尤指對於某一政治實體的統治,例如統治一個國家,亦指對於一國內外事務之監督與管制。"
> quanteda::tokens(txt_cn)
tokens from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"   ","     "也是"  
[14] "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"   "所"     "结成"   "的"     "特定"  
[27] "关系"   ","     "尤"     "指"     "對於"   "某一"   "政治"   "實體"   "的"     "統治"   ","     "例如"   "統治"  
[40] "一個"   "國家"   ","     "亦"     "指"     "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"  
[53] "。"  

The answer to this mystery was found in stringi::stri_split_boundaries, which is the underlying function of quanteda’s tokenizer. stri_split_boundaries utilizes a library called ICU (International Components for Unicode) and the library uses dictionaries for segmentation of texts in Chinese, Japanese, Thai or Khmer. The Japanese dictionary is actually a IPA dictionary, which Mecab also depends on.

This means that those who perform bag-of-words text analysis of Chinese, Japanese, Thai or Khmer texts no longer need to install POS tagger for word segmentation. This would be a massive boost of social scientific text analysis in those languages!

Stringiによる日本語と中国語のテキストの分かち書き

Standard

MecabやChasenなどのによる形態素解析が、日本語のテキストの分かち書きには不可欠だと多くの人が考えていますが、必ずしもそうではないようです。このことを知ったのは、quantedaのトークン化の関数を調べている時で、日本語のテキストをこの関数に渡してみると、単語が Mecabと同じように、きれいに単語に分かれたからです。

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokens(txt_jp)
tokens from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"           "な"          
 [9] "影響"         "を"           "及"           "ぼ"           "し"           "、"           "社会"         "で"          
[17] "生きる"       "ひとりひとり" "の"           "人"           "の"           "人生"         "に"           "も"          
[25] "様々"         "な"           "影響"         "を"           "及ぼす"       "複雑"         "な"           "領域"        
[33] "で"           "ある"         "。" 

quantedaには、形態素解析の機能がないのですが、そのトークン化関数は、中国語のテキストもきれいに、分かち書きをしたのは意外でした。

> txt_cn <- "政治是各种团體进行集体决策的一个过程,也是各种团體或个人为了各自的領域所结成的特定关系,尤指對於某一政治實體的統治,例如統治一個國家,亦指對於一國內外事務之監督與管制。"
> quanteda::tokens(txt_cn)
tokens from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"   ","     "也是"  
[14] "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"   "所"     "结成"   "的"     "特定"  
[27] "关系"   ","     "尤"     "指"     "對於"   "某一"   "政治"   "實體"   "的"     "統治"   ","     "例如"   "統治"  
[40] "一個"   "國家"   ","     "亦"     "指"     "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"  
[53] "。"  

もっと調べてみると、この不思議な挙動は、トークン化関数が基づくstringi::stri_split_boundariesがICU (International Components for Unicode) を利用しており、そこでは中国語、日本語、タイ語、クメール語は、辞書に基づく分かち書きをされているからで、その日本語辞書は Mecabでも利用されているIPA辞書だからということがわかりました。

stringiによる分かち書きができるとすれば、構文分析を用いない日本語や中国語のテキスト分析においては、形態素解析を利用する必要がないということなので、技術的な障壁が下がり、日本語や中国語のテキストの社会科学的な分析がこれから広まるでしょう。

Visualizing media representation of the world

Standard

I uploaded an image visualizing foreign news coverage early this year, but I found that the the image is very difficult to interpret, because both large positive and negative values are important in SVD. Large positive values can be results of intense media attention, but what large negative values mean?

A solution to this problem is use of non-negative matrix factorization (NMF). As the name of the algorithm suggests, matrices decomposed by NMF is restricted to be all positive, so much easier to interpret. I used this technique to to visualize media representation of the world the Guardian, the BBC, and The Irish Times in 2012-2016. I first created a large matrix from co-occurrences of countries in news stories, and I reduced the dimension of the secondary countries (columns) to five.

Rows of the matrices are sorted in order of the raw frequency counts. Australia (AU), China (CN), Japan (JP), India (IN) are highlighted, because I was interested in how the UK was represented by the media in relation to non-EU countries in this analysis.

brexit_heatmap_all

I can identify columns that can be labeled ‘home-centric, ‘European’, ‘the Middle East’ and ‘Asia-Pacific’ clusters based on the prominent countries. In the Guardian, the home-centric cluster is G1 because Britain is the single most important country as the very dark color of the cell shows. The European cluster is G3, because Germany, Greece, Spain, Italy and Ireland have dark cells in the column. The Middle East cluster is G2, in which we find dark cells for Syria (SY), Iran (IR), Afghanistan (AF) and Israel (IL). The Asia-Pacific cluster is G5, where China (CN), Australia (AU), Japan (JP), India (IN) and United States (US) and Canada (CA) are prominent. In the BBC, the home-centric cluster is G1, and the European cluster is G4, where France, Greece, Spain and Germany are found, although the United States are also found. The Middle East cluster is G2, which includes Syria, Iraq, Egypt (EG), and Israel (IL). The Asia-Pacific cluster is G3, where China, the United States, Australia, India and Japan are found. In the Irish newspaper, the home-centric cluster is G1, the European cluster is G4, the Middle East cluster is G3, and the Asia-Pacific cluster is G5.

More detail is available in Britain in the world: Visual analysis of international relations in the British and the Irish media before the referendum presented in the 2016 Political Studies Association Conference.