Segmentation of Japanese or Chinese texts by stringi

Standard

Use of POS tagger like Mecab and Chasen is considered necessary for segmentation of Japanese texts because words are not separated by spaces like European languages, but I recently learned this is not always the case. When I was testing quanteda‘s tokenization function, I passed a Japanese text to it without much expectation, but the result was very interesting: Japanese words were segmented very nicely! The output was as actual as one from Mecab.

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokenize(txt_jp)
tokenizedTexts from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"          
 [8] "な"           "影響"         "を"           "及"           "ぼ"           "し"           "、"          
[15] "社会"         "で"           "生きる"       "ひとりひとり" "の"           "人"           "の"          
[22] "人生"         "に"           "も"           "様々"         "な"           "影響"         "を"          
[29] "及ぼす"       "複雑"         "な"           "領域"         "で"           "ある"         "。"

Also, quanteda’s tokenzer segmented Chinese texts very well, but it should not work like this, because there is no POS tagger in the package.

> txt_cn <- "政治是各种团體进行集体决策的一个过程,也是各种团體或个人为了各自的領域所结成的特定关系,尤指對於某一政治實體的統治,例如統治一個國家,亦指對於一國內外事務之監督與管制。"
> quanteda::tokenize(txt_cn)
tokenizedTexts from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"  
[12] ","     "也是"   "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"  
[23] "所"     "结成"   "的"     "特定"   "关系"   ","     "尤"     "指"     "對於"   "某一"   "政治"  
[34] "實體"   "的"     "統治"   ","     "例如"   "統治"   "一個"   "國家"   ","     "亦"     "指"    
[45] "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"   "。"  

The answer to this mystery was found in stringi::stri_split_boundaries, which is the underlying function of quanteda’s tokenizer. stri_split_boundaries utilizes library called ICU (International Components for Unicode) and the library uses dictionaries for segmentation of texts in Chinese, Japanese, Thai or Khmer. The Japanese dictionary is actually a IPA dictionary, which Mecab is also dependent on.

This means that people who perform bag-of-words text analysis of Chinese, Japanese, Thai or Khmer texts do not longer need to install POS tagger for word segmentation. This would be a massive boost of social scientific text analysis in those languages!

Stringiによる日本語と中国語のテキストの分かち書き

Standard

MecabやChasenなどのによる形態素解析が、日本語のテキストの分かち書きには不可欠だと多くの人が考えていますが、必ずしもそうではないようです。このことを知ったのは、quantedaのトークン化の関数を調べている時で、日本語のテキストをこの関数に渡してみると、単語が Mecabと同じように、きれいに単語に分かれたからです。

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokenize(txt_jp)
tokenizedTexts from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"          
 [8] "な"           "影響"         "を"           "及"           "ぼ"           "し"           "、"          
[15] "社会"         "で"           "生きる"       "ひとりひとり" "の"           "人"           "の"          
[22] "人生"         "に"           "も"           "様々"         "な"           "影響"         "を"          
[29] "及ぼす"       "複雑"         "な"           "領域"         "で"           "ある"         "。"

quantedaには、形態素解析の機能がないのですが、そのトークン化関数は、中国語のテキストもきれいに、分かち書きをしたのは意外でした。

> txt_cn <- "政治是各种团體进行集体决策的一个过程,也是各种团體或个人为了各自的領域所结成的特定关系,尤指對於某一政治實體的統治,例如統治一個國家,亦指對於一國內外事務之監督與管制。"
> quanteda::tokenize(txt_cn)
tokenizedTexts from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"  
[12] ","     "也是"   "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"  
[23] "所"     "结成"   "的"     "特定"   "关系"   ","     "尤"     "指"     "對於"   "某一"   "政治"  
[34] "實體"   "的"     "統治"   ","     "例如"   "統治"   "一個"   "國家"   ","     "亦"     "指"    
[45] "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"   "。"  

もっと調べてみると、この不思議な挙動は、トークン化関数が基づくstringi::stri_split_boundariesがICU (International Components for Unicode) を利用しており、そこでは中国語、日本語、タイ語、クメール語は、辞書に基づく分かち書きをされているからで、その日本語辞書は Mecabでも利用されているIPA辞書だからということがわかりました。

stringiによる分かち書きができるとすれば、構文分析を用いない日本語や中国語のテキスト分析においては、形態素解析を利用する必要がないということなので、技術的な障壁が下がり、日本語や中国語のテキストの社会科学的な分析がこれから広まるでしょう。

Visualizing media representation of the world

Standard

I uploaded an image visualizing foreign news coverage early this year, but I found that the the image is very difficult to interpret, because both large positive and negative values are important in SVD. Large positive values can be results of intense media attention, but what large negative values mean?

A solution to this problem is use of non-negative matrix factorization (NMF). As the name of the algorithm suggests, matrices decomposed by NMF is restricted to be all positive, so much easier to interpret. I used this technique to to visualize media representation of the world the Guardian, the BBC, and The Irish Times in 2012-2016. I first created a large matrix from co-occurrences of countries in news stories, and I reduced the dimension of the secondary countries (columns) to five.

Rows of the matrices are sorted in order of the raw frequency counts. Australia (AU), China (CN), Japan (JP), India (IN) are highlighted, because I was interested in how the UK was represented by the media in relation to non-EU countries in this analysis.

brexit_heatmap_all

I can identify columns that can be labeled ‘home-centric, ‘European’, ‘the Middle East’ and ‘Asia-Pacific’ clusters based on the prominent countries. In the Guardian, the home-centric cluster is G1 because Britain is the single most important country as the very dark color of the cell shows. The European cluster is G3, because Germany, Greece, Spain, Italy and Ireland have dark cells in the column. The Middle East cluster is G2, in which we find dark cells for Syria (SY), Iran (IR), Afghanistan (AF) and Israel (IL). The Asia-Pacific cluster is G5, where China (CN), Australia (AU), Japan (JP), India (IN) and United States (US) and Canada (CA) are prominent. In the BBC, the home-centric cluster is G1, and the European cluster is G4, where France, Greece, Spain and Germany are found, although the United States are also found. The Middle East cluster is G2, which includes Syria, Iraq, Egypt (EG), and Israel (IL). The Asia-Pacific cluster is G3, where China, the United States, Australia, India and Japan are found. In the Irish newspaper, the home-centric cluster is G1, the European cluster is G4, the Middle East cluster is G3, and the Asia-Pacific cluster is G5.

Visualizing foreign news coverage

Standard

The challenge in international news research is identifying patterns in foreign news reporting, which cover thousands of events in hundreds of countries, but visualization seems to be useful. This chat summarizes foreign news coverage by the New York Times between 2012 and 2014 with heatmaps, where rows and columns respectively representing the most frequent countries and the most significant events (or groups of events).

For example, bright cells in rows labeled SY show coverage on the Syrian civil war by the newspaper. In 2012, the civil war was the fifth most important event (E5), and it then became the second most important event (E2) in 2013, although the newspaper shifted its focus to the Ukraine crisis (E2) in 2014, making the Syrian civil war insignificant. In 2015, however, the Syrian civil war became the second most important event again. Further, in this year, Syria is also high in E8, which is the European refugee crisis, along with Greece (Greece), Iraq (IQ), and the UK (GB). Germany (DE) is also significant in the event, although the country does not appear in the chart as it is only 12th from the top.

heatmap5_nyt

This chart was created from news summaries geographically classified by Newsmap. For each of the years, a date-country matrix was created and its date dimension were reduced by SVD to identify the 10 most important events. Scores for countries were centered and normalized for each event.

Best paper award at ICA methodology pre-conference

Standard

I presented my paper on geographical classification, in the methodology pre-conference at ICA in Fukuoka, Japan. The pre-conference has historical significance as the first methodology group at a major international conference of media and communication studies. There were a lot of interesting presentations, but, to my big surprise, I won a Best Paper Award from the organizer. According to the judges, I revived the award because I had done the most rigorous validation of a new methodology.

Newsmap in R

Standard

I have been using Newsmap in many of my research projects as one of the key tools, but I was not able share the tool with other people as it was a complex Python system. To make the tool widely available, I recently implemented Newsmap in R. The R version is dependent on another text analysis package Quanteda, which I am also contributing to.

Newsmap is back

Link

International Newsmap has been offline for a while due to a restriction imposed by my server hosting company. The number of news stories in Newsmap has been increasing since 2011, and the company decided to disable the database. I left it offline, lacking motivation to restore the system, but an email from an email from a student in an international journalism course in Portugal encourage me to do the job.

I moved the website to a new, more powerful server and upgraded its user interface as well as the system behind. The media sources are reduced to 11 newspapers and satellite news, which I think interesting. Geographical classifier is updated to the latest version, and training is solely based on stories collected from Yahoo News.

In the new user interface, there is not button to switch between daily, monthly or yearly, but if you click + signs next to the dates in the right-hand side, you can see monthly or daily figures.

Analysis of Russian media

Standard

Application of the techniques developed with English language texts to other languages is not so easy, but I managed to adapt my LSS system to Russia language for a project on Russian media framing of street protests. In the project, I am responsible for data collection and analysis of Russian language news collected from state-controlled media between 2011-2014. The dictionary that I created measures a dimension of protests as freedom of expression vs. social disorder as good as human coders do. The detail of the dictionary construction procedure is available in one of my posts. I will keep positing to the project blog.

Countries with state-owned news agencies

Standard

It is only little recognized, even among the students of mass media, that international news system is a network of national or regional news agencies, and that many of those are state-owned. Fully commercial agencies like Reuters are very rare, and even international news agencies, such as AFP, are often subsidized by the government. In order to have a broad picture of the state-ownership of news agencies, I collected information from BBC’s media profile, and identified countries that have state-run news agencies. It turned out that among 114 countries in the source, 40.3% of the countries have state-run agencies.

In this plot, red-colored countries have state-run news agencies, and we notice that they usually neither have very small or large economies measured by GDP, because large domestic media markets increase independence of news agencies by commercial operation, and small economies simply cannot support national news agencies.

The more important is the concentration of the state-owned agencies in countries with limited press freedom: press freedom below 40 points is considered ‘not free’ by the Freedom House. This means that news reports from those state-run news agencies in unfree countries might be biased in favor of the government, and those stories come into the international news distribution system. Are the foreign news stories we read free from such biases?

state-run_agencies

ITAR-TASS’s coverage of annexation of Crimea

Standard

My main research interest is estimation of media biases using text analysis techniques. I did a very crude analysis of ITAR-TASS’s coverage of the Ukraine crisis two years ago, but it is time to redo everything with more sophisticated tools. I created a positive-negative dictionaries for democracy and sovereignty, and applied them to see how the Russian news agency cover events related to annexation of Crimea.

In the both charts, TASS’s (red) coverage shifts positively during the period between the announcement of the annexation (K1) and the referendum (K3). The change is visible not only in the absolute terms, but also in relation to Interfax’s (blue).

cri_sov
cri_dem

The positive shift is due to the positive coverage of two key events by TASS. When the mean score of the +-3 days of K2, when the question of the referendum was changed from independence from Ukraine to annexation to Russia, is calculated, its stories on Crimea sovereignty appear to be really positive (11.7 points higher than Interfax; p < 0.01). The second high point is the day of the referendum (K3), of course, when more than 95% of Crimeans allegedly voted for annexation. For the seven days period, the state of the democracy in Crimea becomes very good in TASS's news stories (4.09 point higher than Interfax; p < 0.02). Why can I compare TASS with Interfax? It is because their framing of Ukraine, excluding Crime, (bold) is more or less the same during the same period, and difference only found in Crimea must be due to difference in their status i.e. TASS is state-owned, while Interfax is commercial, and the interest of the Kremlin in Crimea.