Analyze big data with small RAM

Standard

A lot of people are using quanteda to analyze social media posts because it is very fast and flexible, but they sometimes face dramatic slow down due to memory swapping caused by insufficient sizes of RAM. quanteda requires the size of RAM to be 5 times larger than the data to analyze, but it can be 10 times when the data is comprised of many short documents. For example, in the first block of code, the original texts (txt) are only 286 MB on memory, but the tokens object (toks) is 987.8 MB and the document-feature matrix (mt) is 1411.8 MB (measured by object.size()):

require(quanteda)
txt <- readLines('tweets.txt') # 286 MB
length(txt) # 3000000

toks <- tokens(txt) # 987.8 MB
mt <- dfm(toks) # 1411.8 MB

I recommend users to install as much RAM as possible into their machines, but there is a way to analyze big data with small RAM. You cannot avoid having the large document-feature matrix because this is what you need to analyze, but we can skip the tokens object as it is only an intermediate here.

In the second bloc of code, I split the data (txt) into chunks of 10,000 documents and pass them to dfm() chunk-by-chunk to avoid creating a larger tokens object. dfm() still creates a tokens object internally but the size is only around 5MB for that number of documents. Output dfm() is then append to mt using rbind() so that I have all the documents in mt in the end. gc() asks R’s garbage collector to delete unused objects to release memory.

index <- seq(length(txt))
batch <- split(index, ceiling(index / 10000)) # each batch has 10000 indices

for (i in batch) {
    cat("Constructing dfm from", i[1], "\n")
    if (i[1] > 1) {
        mt <- rbind(mt, dfm(txt[i]))
    } else {
        mt <- dfm(txt[i])
    }
    gc()
}
ndoc(mt) # 3000000

You might wonder why the size of a chunk is 10,000 and not 20,000 or 50,000,? The reason is that quanteda performs tokenization iteratively for every 10,000 documents, and it is the largest possible size that do not trigger another loop.

Relaxing R version requirement

Standard

Until quanteda v1.1, our users needed to have R 3.4.0 installed, but we relax the requirement to R 3.1.0, because people working in companies or other large organizations often do not have latest version of R in their computers, and therefore cannot use our package. To investigate why quanteda requires R 3.4.0 quickly, I wrote a function called print_depends():

require(stringi)

print_depends <- function(package, level = 0) {
    desc <- packageDescription(package)
    if ("Depends" %in% names(desc)) {
        dep <- stri_trim_both(desc$Depends)
        r <- stri_extract_first_regex(dep, 'R \\(.*?\\)')
        if (is.na(r)) r <- ''
        cat(strrep('  ', level), package, ": " , r, "\n", sep = '')
        if ("Imports" %in% names(desc)) {
            imp <- paste0(desc$Depends, ', ', desc$Imports)
            if (r != '') 
                imp <- stri_replace_first_fixed(imp, r, '')
            imp <- stri_replace_all_regex(imp, ' \\(.*?\\)', '')
            imp <- unlist(stri_split_regex(imp, ','))
            imp <- stri_trim_both(imp)
            imp <- imp[imp != '']
            for(i in imp) {
                print_depends(i, level + 1)
            }
        }
    }
}
print_depends('quanteda')

The output showed that quanteda needs R 3.4.0 only because slam requires. Our package depends on slam because of wordcloud.

quanteda: R (>= 3.4.0)
  extrafont: R (>= 2.15)
    extrafontdb: R (>= 2.14)
    Rttf2pt1: R (>= 2.15)
  digest: R (>= 2.4.1)
  Matrix: R (>= 3.0.1)
    lattice: R (>= 3.0.0)
  wordcloud: 
    RColorBrewer: R (>= 2.0.0)
    slam: R (>= 3.4.0)
    Rcpp: R (>= 3.0.0)
  sna: R (>= 2.0.0)
  network: R (>= 2.10)
  ggrepel: R (>= 3.0.0)
    ggplot2: R (>= 3.1)
      digest: R (>= 2.4.1)
      gtable: R (>= 2.14)
      MASS: R (>= 3.1.0)
      plyr: R (>= 3.1.0)
        Rcpp: R (>= 3.0.0)
      reshape2: R (>= 3.1)
        plyr: R (>= 3.1.0)
          Rcpp: R (>= 3.0.0)
        Rcpp: R (>= 3.0.0)
        stringr: R (>= 3.1)
          glue: R (>= 3.1)
          stringi: R (>= 2.14)
      scales: R (>= 2.13)
        RColorBrewer: R (>= 2.0.0)
        dichromat: R (>= 2.10)
        plyr: R (>= 3.1.0)
          Rcpp: R (>= 3.0.0)
        Rcpp: R (>= 3.0.0)
        R6: R (>= 3.0)
        viridisLite: R (>= 2.10)
      tibble: R (>= 3.1.0)
        cli: R (>= 2.10)
        rlang: R (>= 3.1.0)
      lazyeval: R (>= 3.1.0)
    Rcpp: R (>= 3.0.0)
    scales: R (>= 2.13)
      RColorBrewer: R (>= 2.0.0)
      dichromat: R (>= 2.10)
      plyr: R (>= 3.1.0)
        Rcpp: R (>= 3.0.0)
      Rcpp: R (>= 3.0.0)
      R6: R (>= 3.0)
      viridisLite: R (>= 2.10)
  RcppParallel: R (>= 3.0.2)
  RSpectra: R (>= 3.0.2)
    Matrix: R (>= 3.0.1)
      lattice: R (>= 3.0.0)
    Rcpp: R (>= 3.0.0)
  stringi: R (>= 2.14)
  ggplot2: R (>= 3.1)
    digest: R (>= 2.4.1)
    gtable: R (>= 2.14)
    MASS: R (>= 3.1.0)
    plyr: R (>= 3.1.0)
      Rcpp: R (>= 3.0.0)
    reshape2: R (>= 3.1)
      plyr: R (>= 3.1.0)
        Rcpp: R (>= 3.0.0)
      Rcpp: R (>= 3.0.0)
      stringr: R (>= 3.1)
        glue: R (>= 3.1)
        stringi: R (>= 2.14)
    scales: R (>= 2.13)
      RColorBrewer: R (>= 2.0.0)
      dichromat: R (>= 2.10)
      plyr: R (>= 3.1.0)
        Rcpp: R (>= 3.0.0)
      Rcpp: R (>= 3.0.0)
      R6: R (>= 3.0)
      viridisLite: R (>= 2.10)
    tibble: R (>= 3.1.0)
      cli: R (>= 2.10)
      rlang: R (>= 3.1.0)
    lazyeval: R (>= 3.1.0)
  XML: R (>= 2.13.0)
  lubridate: R (>= 3.0.0)
    stringr: R (>= 3.1)
      glue: R (>= 3.1)
      stringi: R (>= 2.14)
    Rcpp: R (>= 3.0.0)
  spacyr: R (>= 3.0.0)
    data.table: R (>= 3.0.0)
    reticulate: R (>= 3.0)
      jsonlite: 
      Rcpp: R (>= 3.0.0)
  stopwords: R (>= 2.10)
    ISOcodes: R (>= 2.10.0)

Since we never use tm for text processing, we decided to write our own function for word cloud incorporating the publicly available code and drop tm from our dependencies. Since this change in v1.1.0, quanteda and other packages that depend on it (e.g. tidytext) can be installed only with R 3.1.0, which was released over three years ago. I hope
this change to make quantitative text analysis more accessible to R users, especially to those in the industry.

PhD thesis is now archived

Standard

My PhD thesis titled Measuring bias in international news: a large-scale analysis of news agency coverage of the Ukraine crisis has been archived electronically in the LSE Library and become publicly available.

This thesis is a compilation of research papers, three of which have already been published, but its grand conclusion is more than a summary of findings in these papers. I argue that, although many still strongly believe that the western and non-western media report key geopolitical events significantly differently correspondingly to their home countries’ strategic interests, the difference between the western and non-western media is becoming less clear and media’s alignment with their governments are weaker due to their global operations and non-western governments’ propaganda effort. This trend was most clear when Reuters replicated and spread the Russian government’s narratives about Ukraine during the crisis in 2014.

Release of Quanteda version 1.0

Standard

We have announced the release of quanteda version 1.0 at the London R meeting on Tuesday. I thank all the organizers and 150+ participants. In the talk, I presented the performance comparison with R and Python packages, but I actually compared the performance with its earlier CRAN versions to show how the package evolved to be the best performing text analysis in R.

In this historical benchmaking, I measured time the earlier versions take to complete three basic operations (in second):

  • Tokenization of 6,000 newspaper articles (‘tokens’)
  • Removal of English stopwords from the tokenized texts (‘remove’)
  • Construction of document-feature matrix (‘dfm’)

The earliest versions of quanteda was fast simply because it had only limited functionality. Its tokenization and document-feature matrix construction became considerably slower from v0.8.2 as more functions, such as Unicode support, have been implemented. There was almost no change in speed until v0.9.8.5, but it token selection and document-feature matrix construction became dramatically fast in the next release. This is exactly when we introduced the upfront tokens serialization design. It only speeds up operations after tokenization, but execution time became half in tokens selection and one seventh in document-feature matrix construction!

historical benchmarking

After improving the performance, we worked hard on consistency in API and stability of C++. Besides the regression in token selection the version before 0.99.9, it has been very fast until now. Tokenization went through up and down in speed in gradual optimization to the new design, but it is also one of the fastest since v0.8.2.

A new paper on Russian media’s coverage of protests in Ukraine

Standard

A paper ‘Russian Spring’ or ‘Spring Betrayal’? The Media as a Mirror of Putin’s Evolving Strategy in Ukraine that I co-authored with Tomila Lankina as part of the British Academy-funded project appeared in Europe-Asia Studies.

We analyse Russian state media’s framing of the Euromaidan protests using a novel Russian-language electronic content-analysis dictionary and method that we have developed ourselves. We find that around the time of Crimea’s annexation, the Kremlin-controlled media projected media narratives of protests as chaos and disorder, using legalistic jargon about the status of ethnic Russians and federalisation, only to abandon this strategy by the end of April 2014. The shift in media narratives corresponding to the outbreak of violence in the Donbas region gives credence to arguments about Putin’s strategic, interests-driven foreign policy, while adding nuance to those that highlight the role of norms and values.

I also made the longitudinal content analysis technique more accessible by updating an the LSS package.

Historical analysis of NYT using web API

Standard

We usually use commercial database such as Nexis to download news stories in the past, but you should use New York Times APIs if you want to do historical analysis of news content. We can search NYT news articles until 1851 through the API, and it is free for anyone! We can only download meta-data, including summary texts (lead paragraphs), but we can still do a lot of content analysis with it.

You have to collect a lot of items when each text is short. It should not be difficult to so through the API if you use rtimes package. However, it is actually not as easy as it sound, because web APIs sometimes do not respond, and we can only call the API 1000 times a day. Therefore, our downloader have to be robust against unstable connections, and able to resume downloading next day.

After several attempts, I managed to run download without unexpected errors. Using the code below, you can download summaries of NYT articles that contain ‘diplomacy’ or ‘military’ in their main texts between 1851 and 2000. This program saves downloaded data yearly to RSD files, so that you do not loose, even if you have to restart your R. Do not forget to replace xxxxxxxxxxxxxxxxxxxxxxxxxxxx wit your own API key.


#install.packages("rtimes") rm(list=ls()) require(rtimes) require(plyr) httr::config(timeout = 120) query <- '(body:"diplomacy" OR body:"military")' field <- c("_id", "page", "snippet", "word_count", "score", "headline.main", "headline.print_headline", "byline.original", "web_url") fetch <- function(query, year, page) { res <- as_search(q = NULL, fq = query, begin_date = paste0(year, "0101"), end_date = paste0(year, '1231'), key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx', page = page, fl = c('_id', 'pub_date', 'word_count', 'snippet', 'headline', 'section_name', 'byline', 'web_url')) return(res) } for (year in seq(1851, 2000)) { if (file.exists(paste0('API/temp/', year, '.RDS'))) { cat('Skip', year, "\n") next } cat('Seach', year, "\n") data <- data.frame() res <- NULL page <- 1 while (is.null(res) || res$meta$hits > 10 * page) { res <- NULL attempt <- 0 while (is.null(res) && attempt <= 5) { attempt <- attempt + 1 try( res <- fetch(query, year, page) ) if (is.null(res)) { cat('Error', attempt,'\n') Sys.sleep(30) } if (attempt > 5) { stop('Aborted\n') } } if (nrow(res$data) == 0) { cat('No data\n') break } res$data$page <- page data <- rbind.fill (data, res$data) cat(10 * page, 'of', res$meta$hits, "\n") Sys.sleep(5) page <- page + 1 } if (nrow(data) > 0) { data$year <- year saveRDS(data, file = paste0('API/temp/', year, '.RDS')) } Sys.sleep(5) }

What is the best SVD engine for LSA in R?

Standard

I use latent semantic analysis (LSA) to extract synonyms from a large corpus of news articles. I was very happy with Gensim‘s LSA function, but I was not sure how to do LSA in R as good as in Python. There is an R package called lsa, but it is unsuitable for large matrices, because its underlying function svd() calculates all the singular values. Since I usually split documents into sentences in this task, my document-feature matrix is very large and extremely sparse.

It is easy to make an LSA function myself, but the question is which is the best SVD engine in R for this application? rsvd, irlba or RSpectra? The authors claim that their package is the fastest, but it seems depending on the size of the matrix to decompose and the number of singular values to ask for. rsvd seems very fast with small matrices, but it used more than 20GB of RAM on my Linux machine for a matrix created from only 1,000 news articles, while irlba and RSpectra require much less memory space.

I compared irlba and RSpectra in terms of its speed and accuracy using corpora in different sizes. The original corpus is comprised on 300K full-text New York Times news stories on politics. I randomly sampled news stories to construct sub-corpus and removed function words using quanteda for this benchmarking. Arguments of the functions are set in the following way:

# irlba
S <- irlba::irlba(x, nv = 300, center = Matrix::colMeans(x), verbose = FALSE, right_only = TRUE, tol = 1e-5)

# RSpectra
S <- RSpectra::svds(x, k = 300, nu = 0, nv = 300, opts = list(tol = 1e-5))

It is straight forward to measure the speed of the SVD engines: repeatedly create sub-corpora of between 1-10K documents, and record execution time. The result shows that RSpectra is roughly 5 times faster than irlba regardless of the sizes of the corpora.

It is more difficult to gauge the quality of SVD, but I achieved this by calculating cosine similarity of words to an English verb and counting its word stems in top 100 words. For example, when most similar words to ‘ask’ are extracted based on cosine similarity, I expected to find its inflicted forms such as ‘asked’, ‘asks’, ‘asking’ in the top 100 if decomposition is accurate. I cannot tell how many inflicted forms they should extract, but a larger number for the same word suggests higher accuracy. I used 25 common English words, and calculated average number of such words here.

word <- c('want', 'use', 'work', 'call', 'try', 'ask', 'need', 'seem', 
          'help', 'play', 'move', 'live', 'believe', 'happen', 'include', 
          'continue', 'change', 'watch', 'follow', 'stop', 'create', 'open', 
          'walk', 'offer', 'remember')

The differences between RSpectra and irlba aren’t large, but the former still outperformed the latter in all the croups sizes. It is surprising that RSpectra did not compromise its accuracy for its speed. Interestingly, the the curves for both package become flat on the right-hand side, suggesting there is no need to construct corpus larger than 8K documents (~400K sentences) for synonym extraction tasks.

My conclusion based on this benchmarking is that RSpectra is the best for LSA application in R. Nonetheless, since irlba is being actively developed to improve its performance, we should keep eyes of the package too.