Obstruction to Asian-language text analysis

In a presentation titled Internationalizing Text Analysis at a workshop on the 27th June at Waseda University, I and Oul Han discussed what obstructing adoption of quantitative text analysis techniques in Japan and Korea. Our question is why there are only few people who do quantitative analysis of Japanese and Korean texts, despite it is becoming one of the mainstream methodologies in North America and Europe? To explain this, we identified four key fields: tools, data, skills, literature.


We have seen exciting development in text analysis tools in recent years. Support of Unicode has improved dramatically by the stringi package. We have released quanteda that enables analysis of Asian language texts in the same way as English. There have been morphological analysis tools in R (RMeCab and RMecabKo), but RcppMeCab, that supports both Japanese and Korean, has been releases recently. In terms of available tools, there is no reason not embarking on quantitative analysis of Asia texts.


Official political documents are publicly available in both Japan and Korea, but unofficial political documents such as election manifestos are not. Further, media texts are generally more difficult to collect because of copy-right protection. While Korean newspaper articles are available in KINDS and the Dow Jones Factiva database, Japanese newspaper articles are only available in the publishers’ commercial databases. It takes time to improve accessibility to textual data, but we should start making exhaustive lists of Japanese and Korean sources as a start.


You need different skills in stages of a text analysis project. Designing social scientific research using quantitative text analysis requires broad knowledge of the techniques and their applications. Data collection often involves access to APIs or use of scrapers, that demand knowledge of machine readable formats (HTML, XML, JSON), and computer programming. Quantitative text analysis is not always statistical, but you still need to know descriptive and inferential statistics (e.g. chi-square, t-test, regression analysis). These skills can be acquired through lectures and seminars, but very few or no text analysis courses are offered in Japanese and Korean universities. Until such courses to become widely available, we need to organize workshops to train future text analysts.


The lack of standard textbook on social scientific text analysis has been one of the biggest problems, limiting the opportunity to acquire the above-mentioned skills to people based in North America or Europe. Aiming to address this problem, I created an online textbook with Stefan Müller, but pages are all in English. I recently added a section to explain language-specific pre-processing, but there is only one page for Japanese. We should translate the online textbook to other languages and add more pages on how to handle Asia languages texts.

If you want to know more, please see the slides.

Posts created 71

4 thoughts on “Obstruction to Asian-language text analysis

  1. I searched “文本分析(text analysis)“ in CNKI (China National Knowledge Infrastructure, 中国知网), and found most of the results which are classified in the field of politics are simple analysis of Chinese government’s policy papers. I think the situation in China is even worse. Few Chinese political scientists are now proficient in statistics and programming. Moreover, the Chinese academic journals will censor the articles on Chinese politics very strictly, which makes the publication of these papers very difficult. However, I know some young Chinese scholars who are working hard on the data science approach to politics, and some Chinese journals are very open-minded to new methods in political science. Next month, I will write an article introducing text analysis method for a Wechat subscription which are followed by many Chinese students and scholars majoring in political science.

    1. I really hope that there will be many more Asian scholars who analyze texts in their languages. Quantitative analysis is usually less ideological, so it can be a better tool in countries with limited political or academic freedom.

  2. Hi Kohei, I am Rong. I have seen your online textbook on quanteda. It is so great. It is very clear and helped me to use quanteda through my own research. It is a very helpful tool. Also I noticed that you want to add sections on different language which is also very useful for lots of Chinese users. If it is possible, I’d like to volunteer to translate your textbook into Chinese as well as do the Chinese language demo. Email me if I can help.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top