Obstruction to Asian-language text analysis

Standard

In a presentation titled Internationalizing Text Analysis at a workshop on the 27th June at Waseda University, I and Oul Han discussed what obstructing adoption of quantitative text analysis techniques in Japan and Korea. Our question is why there are only few people who do quantitative analysis of Japanese and Korean texts, despite it is becoming one of the mainstream methodologies in North America and Europe? To explain this, we identified four key fields: tools, data, skills, literature.

Tools

We have seen exciting development in text analysis tools in recent years. Support of Unicode has improved dramatically by the stringi package. We have released quanteda that enables analysis of Asian language texts in the same way as English. There have been morphological analysis tools in R (RMeCab and RMecabKo), but RcppMeCab, that supports both Japanese and Korean, has been releases recently. In terms of available tools, there is no reason not embarking on quantitative analysis of Asia texts.

Data

Official political documents are publicly available in both Japan and Korea, but unofficial political documents such as election manifestos are not. Further, media texts are generally more difficult to collect because of copy-right protection. While Korean newspaper articles are available in KINDS and the Dow Jones Factiva database, Japanese newspaper articles are only available in the publishers’ commercial databases. It takes time to improve accessibility to textual data, but we should start making exhaustive lists of Japanese and Korean sources as a start.

Skills

You need different skills in stages of a text analysis project. Designing social scientific research using quantitative text analysis requires broad knowledge of the techniques and their applications. Data collection often involves access to APIs or use of scrapers, that demand knowledge of machine readable formats (HTML, XML, JSON), and computer programming. Quantitative text analysis is not always statistical, but you still need to know descriptive and inferential statistics (e.g. chi-square, t-test, regression analysis). These skills can be acquired through lectures and seminars, but very few or no text analysis courses are offered in Japanese and Korean universities. Until such courses to become widely available, we need to organize workshops to train future text analysts.

Literature

The lack of standard textbook on social scientific text analysis has been one of the biggest problems, limiting the opportunity to acquire the above-mentioned skills to people based in North America or Europe. Aiming to address this problem, I created an online textbook with Stefan Müller, but pages are all in English. I recently added a section to explain language-specific pre-processing, but there is only one page for Japanese. We should translate the online textbook to other languages and add more pages on how to handle Asia languages texts.

If you want to know more, please see the slides.

Analyze big data with small RAM

Standard

A lot of people are using quanteda to analyze social media posts because it is very fast and flexible, but they sometimes face dramatic slow down due to memory swapping caused by insufficient sizes of RAM. quanteda requires the size of RAM to be 5 times larger than the data to analyze, but it can be 10 times when the data is comprised of many short documents. For example, in the first block of code, the original texts (txt) are only 286 MB on memory, but the tokens object (toks) is 987.8 MB and the document-feature matrix (mt) is 1411.8 MB (measured by object.size()):

require(quanteda)
txt <- readLines('tweets.txt') # 286 MB
length(txt) # 3000000

toks <- tokens(txt) # 987.8 MB
mt <- dfm(toks) # 1411.8 MB

I recommend users to install as much RAM as possible into their machines, but there is a way to analyze big data with small RAM. You cannot avoid having the large document-feature matrix because this is what you need to analyze, but we can skip the tokens object as it is only an intermediate here.

In the second bloc of code, I split the data (txt) into chunks of 10,000 documents and pass them to dfm() chunk-by-chunk to avoid creating a larger tokens object. dfm() still creates a tokens object internally but the size is only around 5MB for that number of documents. Output dfm() is then append to mt using rbind() so that I have all the documents in mt in the end. gc() asks R’s garbage collector to delete unused objects to release memory.

index <- seq(length(txt))
batch <- split(index, ceiling(index / 10000)) # each batch has 10000 indices

for (i in batch) {
    cat("Constructing dfm from", i[1], "\n")
    if (i[1] > 1) {
        mt <- rbind(mt, dfm(txt[i]))
    } else {
        mt <- dfm(txt[i])
    }
    gc()
}
ndoc(mt) # 3000000

You might wonder why the size of a chunk is 10,000 and not 20,000 or 50,000,? The reason is that quanteda performs tokenization iteratively for every 10,000 documents, and it is the largest possible size that do not trigger another loop.

Relaxing R version requirement

Standard

Until quanteda v1.1, our users needed to have R 3.4.0 installed, but we relax the requirement to R 3.1.0, because people working in companies or other large organizations often do not have latest version of R in their computers, and therefore cannot use our package. To investigate why quanteda requires R 3.4.0 quickly, I wrote a function called print_depends():

require(stringi)

print_depends <- function(package, level = 0) {
    desc <- packageDescription(package)
    if ("Depends" %in% names(desc)) {
        dep <- stri_trim_both(desc$Depends)
        r <- stri_extract_first_regex(dep, 'R \\(.*?\\)')
        if (is.na(r)) r <- ''
        cat(strrep('  ', level), package, ": " , r, "\n", sep = '')
        if ("Imports" %in% names(desc)) {
            imp <- paste0(desc$Depends, ', ', desc$Imports)
            if (r != '') 
                imp <- stri_replace_first_fixed(imp, r, '')
            imp <- stri_replace_all_regex(imp, ' \\(.*?\\)', '')
            imp <- unlist(stri_split_regex(imp, ','))
            imp <- stri_trim_both(imp)
            imp <- imp[imp != '']
            for(i in imp) {
                print_depends(i, level + 1)
            }
        }
    }
}
print_depends('quanteda')

The output showed that quanteda needs R 3.4.0 only because slam requires. Our package depends on slam because of wordcloud.

quanteda: R (>= 3.4.0)
  extrafont: R (>= 2.15)
    extrafontdb: R (>= 2.14)
    Rttf2pt1: R (>= 2.15)
  digest: R (>= 2.4.1)
  Matrix: R (>= 3.0.1)
    lattice: R (>= 3.0.0)
  wordcloud: 
    RColorBrewer: R (>= 2.0.0)
    slam: R (>= 3.4.0)
    Rcpp: R (>= 3.0.0)
  sna: R (>= 2.0.0)
  network: R (>= 2.10)
  ggrepel: R (>= 3.0.0)
    ggplot2: R (>= 3.1)
      digest: R (>= 2.4.1)
      gtable: R (>= 2.14)
      MASS: R (>= 3.1.0)
      plyr: R (>= 3.1.0)
        Rcpp: R (>= 3.0.0)
      reshape2: R (>= 3.1)
        plyr: R (>= 3.1.0)
          Rcpp: R (>= 3.0.0)
        Rcpp: R (>= 3.0.0)
        stringr: R (>= 3.1)
          glue: R (>= 3.1)
          stringi: R (>= 2.14)
      scales: R (>= 2.13)
        RColorBrewer: R (>= 2.0.0)
        dichromat: R (>= 2.10)
        plyr: R (>= 3.1.0)
          Rcpp: R (>= 3.0.0)
        Rcpp: R (>= 3.0.0)
        R6: R (>= 3.0)
        viridisLite: R (>= 2.10)
      tibble: R (>= 3.1.0)
        cli: R (>= 2.10)
        rlang: R (>= 3.1.0)
      lazyeval: R (>= 3.1.0)
    Rcpp: R (>= 3.0.0)
    scales: R (>= 2.13)
      RColorBrewer: R (>= 2.0.0)
      dichromat: R (>= 2.10)
      plyr: R (>= 3.1.0)
        Rcpp: R (>= 3.0.0)
      Rcpp: R (>= 3.0.0)
      R6: R (>= 3.0)
      viridisLite: R (>= 2.10)
  RcppParallel: R (>= 3.0.2)
  RSpectra: R (>= 3.0.2)
    Matrix: R (>= 3.0.1)
      lattice: R (>= 3.0.0)
    Rcpp: R (>= 3.0.0)
  stringi: R (>= 2.14)
  ggplot2: R (>= 3.1)
    digest: R (>= 2.4.1)
    gtable: R (>= 2.14)
    MASS: R (>= 3.1.0)
    plyr: R (>= 3.1.0)
      Rcpp: R (>= 3.0.0)
    reshape2: R (>= 3.1)
      plyr: R (>= 3.1.0)
        Rcpp: R (>= 3.0.0)
      Rcpp: R (>= 3.0.0)
      stringr: R (>= 3.1)
        glue: R (>= 3.1)
        stringi: R (>= 2.14)
    scales: R (>= 2.13)
      RColorBrewer: R (>= 2.0.0)
      dichromat: R (>= 2.10)
      plyr: R (>= 3.1.0)
        Rcpp: R (>= 3.0.0)
      Rcpp: R (>= 3.0.0)
      R6: R (>= 3.0)
      viridisLite: R (>= 2.10)
    tibble: R (>= 3.1.0)
      cli: R (>= 2.10)
      rlang: R (>= 3.1.0)
    lazyeval: R (>= 3.1.0)
  XML: R (>= 2.13.0)
  lubridate: R (>= 3.0.0)
    stringr: R (>= 3.1)
      glue: R (>= 3.1)
      stringi: R (>= 2.14)
    Rcpp: R (>= 3.0.0)
  spacyr: R (>= 3.0.0)
    data.table: R (>= 3.0.0)
    reticulate: R (>= 3.0)
      jsonlite: 
      Rcpp: R (>= 3.0.0)
  stopwords: R (>= 2.10)
    ISOcodes: R (>= 2.10.0)

Since we never use tm for text processing, we decided to write our own function for word cloud incorporating the publicly available code and drop tm from our dependencies. Since this change in v1.1.0, quanteda and other packages that depend on it (e.g. tidytext) can be installed only with R 3.1.0, which was released over three years ago. I hope
this change to make quantitative text analysis more accessible to R users, especially to those in the industry.

PhD thesis is now archived

Standard

My PhD thesis titled Measuring bias in international news: a large-scale analysis of news agency coverage of the Ukraine crisis has been archived electronically in the LSE Library and become publicly available.

This thesis is a compilation of research papers, three of which have already been published, but its grand conclusion is more than a summary of findings in these papers. I argue that, although many still strongly believe that the western and non-western media report key geopolitical events significantly differently correspondingly to their home countries’ strategic interests, the difference between the western and non-western media is becoming less clear and media’s alignment with their governments are weaker due to their global operations and non-western governments’ propaganda effort. This trend was most clear when Reuters replicated and spread the Russian government’s narratives about Ukraine during the crisis in 2014.