Stringiによる日本語と中国語のテキストの分かち書き

MecabやChasenなどのによる形態素解析が、日本語のテキストの分かち書きには不可欠だと多くの人が考えていますが、必ずしもそうではないようです。このことを知ったのは、quantedaのトークン化の関数を調べている時で、日本語のテキストをこの関数に渡してみると、単語が Mecabと同じように、きれいに単語に分かれたからです。 > txt_jp quanteda::tokens(txt_jp) tokens from 1 document. Component 1 : [1] “政治” “と” “は” “社会” “に対して” “全体” “的” “な” [9] “影響” “を” “及” “ぼ” “し” “、” “社会” “で” [17] “生きる” “ひとりひとり” “の” “人” “の” “人生” “に” “も” [25] “様々” “な” “影響” “を” “及ぼす” “複雑” “な” “領域” [33] “で” “ある” “。” quantedaには、形態素解析の機能がないのですが、そのトークン化関数は、中国語のテキストもきれいに、分かち書きをしたのは意外でした。 > txt_cn […]

Visualizing foreign news coverage

The challenge in international news research is identifying patterns in foreign news reporting, which cover thousands of events in hundreds of countries, but visualization seems to be useful. This chat summarizes foreign news coverage by the New York Times between 2012 and 2014 with heatmaps, where rows and columns respectively representing the most frequent countries […]

Newsmap in R

I have been using Newsmap in many of my research projects as one of the key tools, but I was not able share the tool with other people as it was a complex Python system. To make the tool widely available, I recently implemented Newsmap in R. The R version is dependent on another text […]

Analysis of Russian media

Application of the techniques developed with English language texts to other languages is not so easy, but I managed to adapt my LSS system to Russia language for a project on Russian media framing of street protests. In the project, I am responsible for data collection and analysis of Russian language news collected from state-controlled […]

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top