Obstruction to Asian-language text analysis


In a presentation titled Internationalizing Text Analysis at a workshop on the 27th June at Waseda University, I and Oul Han discussed what obstructing adoption of quantitative text analysis techniques in Japan and Korea. Our question is why there are only few people who do quantitative analysis of Japanese and Korean texts, despite it is becoming one of the mainstream methodologies in North America and Europe? To explain this, we identified four key fields: tools, data, skills, literature.


We have seen exciting development in text analysis tools in recent years. Support of Unicode has improved dramatically by the stringi package. We have released quanteda that enables analysis of Asian language texts in the same way as English. There have been morphological analysis tools in R (RMeCab and RMecabKo), but RcppMeCab, that supports both Japanese and Korean, has been releases recently. In terms of available tools, there is no reason not embarking on quantitative analysis of Asia texts.


Official political documents are publicly available in both Japan and Korea, but unofficial political documents such as election manifestos are not. Further, media texts are generally more difficult to collect because of copy-right protection. While Korean newspaper articles are available in KINDS and the Dow Jones Factiva database, Japanese newspaper articles are only available in the publishers’ commercial databases. It takes time to improve accessibility to textual data, but we should start making exhaustive lists of Japanese and Korean sources as a start.


You need different skills in stages of a text analysis project. Designing social scientific research using quantitative text analysis requires broad knowledge of the techniques and their applications. Data collection often involves access to APIs or use of scrapers, that demand knowledge of machine readable formats (HTML, XML, JSON), and computer programming. Quantitative text analysis is not always statistical, but you still need to know descriptive and inferential statistics (e.g. chi-square, t-test, regression analysis). These skills can be acquired through lectures and seminars, but very few or no text analysis courses are offered in Japanese and Korean universities. Until such courses to become widely available, we need to organize workshops to train future text analysts.


The lack of standard textbook on social scientific text analysis has been one of the biggest problems, limiting the opportunity to acquire the above-mentioned skills to people based in North America or Europe. Aiming to address this problem, I created an online textbook with Stefan Müller, but pages are all in English. I recently added a section to explain language-specific pre-processing, but there is only one page for Japanese. We should translate the online textbook to other languages and add more pages on how to handle Asia languages texts.

If you want to know more, please see the slides.