Computing document similarity in large corpus

Standard

Since early this year, I was asked by many people how to compute document (or feature) similarity in large corpus. They said their functions stops because the lack of space in RAM:

Error in .local(x, y, ...) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92 

This happened in our textstat_simil(margn = "documents") too because matrix multiplication in the function produces dense matrix with (ndoc(x) ^ 2) / 2 elements: the number of cells in the matrix is 5,000,000,000 if your corpus has 100,000 documents!

A solution to this problem is not recording values that is less than a certain threshold. You might be only interested in documents with cosine similarity larger than 0.9 when you study document reuse, for example. We upgraded our functions for document similarity computation, which is used in textstat_simil() and textstat_dist(), to achieve this in the latest Github version of quanteda. We also parallelized the computation in C++ to make it faster.

The new function is called textstat_proxy(). It is still experimental but has two new arguments, min_proxy and rank, to reduced the number of values to save storage. If you set the min_proxy, the function records only values larger than that; if you use rank, it records only top-n largest values for each document (or features).

Benchmarking on my Core i7 laptop showed that the new function is twice as fast as the old one. If either min_proxy or rank is used, it becomes 4 times faster.

> library(quanteda)
> mt <- readRDS("data_corpus_guardian_2016-2017.RDS") %>% 
+       dfm(remove_punct = TRUE, remove = stopwords()) %>% 
+       dfm_trim(min_termfreq = 10)
> dim(mt)
[1] 84599 83573 # 84599 documents
>
> # subset the corpus because the it is too large for the old function
> mt_sub <- dfm_sample(mt, 10000) 
> dim(mt_sub)
[1] 10000 83573 # 10000 documents
> 
> quanteda_options(threads = 8)
> microbenchmark::microbenchmark(
+     old = textstat_simil_old(mt_sub, method = "cosine"),  
+     new = textstat_simil(mt_sub, method = "cosine"),  
+     new_min = textstat_proxy(mt_sub, method = "cosine", min_proxy = 0.9),
+     new_rank = textstat_proxy(mt_sub, method = "cosine", rank = 10),
+     times = 10
+ )
Unit: seconds
     expr       min        lq      mean    median        uq       max neval
      old 22.426574 22.641949 22.788590 22.745563 22.960467 23.160844    10
      new 13.376352 13.417328 13.641411 13.638641 13.699010 14.226246    10
  new_min  4.977046  5.010795  5.119516  5.114965  5.201249  5.314574    10
 new_rank  5.303440  5.322976  5.411015  5.385124  5.482439  5.583506    10

More importantly, we can compute the document similarity between all the 84599 documents in the corpus without problems, if min_proxy is used. It took only 15 minutes on my laptop and the resulting object is as small as 12MB.

> new_min <- textstat_proxy(mt, method = "cosine", min_proxy = 0.9)
> print(object.size(new_min), units = "Mb")
12.4 Mb

If you want to know which documents are similar to which, you can make simple conversion function and run.

> matrix2list <- function(x) {
+   names(x@x) <- rownames(x)[x@i + 1]
+   split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))
+ }
> simil <- matrix2list(new_min)
> head(simil[lengths(simil) > 1])
$text119559
text119554 text119559 
 0.9929825  1.0000000 

$text119561
text119553 text119561 
  0.994557   1.000000 

$text119562
text119557 text119562 
 0.9975438  1.0000000 

$text119564
text119553 text119561 text119564 
 0.9854428  0.9908825  1.0000000 

$text119568
text119555 text119568 
 0.9963637  1.0000000 

$text119570
text119551 text119570 
 0.9586148  1.0000000

textstat_proxy() has a great potential but it is still experimental, because we are not sure what the best format of the resulting objects. If you have any opinion, please post a comment on the GitHub page.

日本語の量的テキスト分析

Standard

より多くの日本人の研究者に量的テキスト分析について関心を持ってもらうために、『日本語の量的分析』という論文をニューヨーク大学のエイミー・カタリナックと一緒に書きました。これまでのところ、Twitterで多くの方からポジティブな反応を頂いています。

本稿は、欧米の政治学者の間で近年人気を集めている量的テキスト分析(quantitative text analysis)と呼ばれる手法の日本語における利用について論ずる。まず、量的テキスト分析が登場した背景を述べたうえで、欧米の政治学においてどのように利用されているかを説明する。次に、読者が量的テキスト分析を研究で利用できるように、日本語の分析において注意すべき点に言及しながら、作業の流れを具体的に説明する。最後に、欧米で利用されている統計分析モデルを紹介した上で、それらが日本語の文書の分析にも利用できることを研究事例を用いて示す。本稿は、近年の技術的および方法論な発展によって、日本語の量的テキスト分析が十分に可能になったことを主張するが、この手法が日本の政治学において広く普及するためには、データの整備など制度的な問題に対処していく必要性があることにも触れる。

Newsmap is available on CRAN

Standard

I am happy to announce that our R package for semi-supervised document classification, newsmap is available on CRAN. This package is simple in terms of algorithms but comes with well-maintained geographical seed dictionaries in English, German, Spanish, Russian and Japanese.

This package was created originally for geographical classification of news articles, but it can also be used for other tasks such as topic classification. For example, I performed two-dimensional classification in my latest paper on conspiracy theory using tentative topical seed words:

economy: [market*, money, bank*, stock*, bond*, industry, company, shop*]
politics: [parliament*, congress*, party leader*, party member*, voter*, lawmaker*, politician*]
society: [police, prison*, school*, hospital*]
diplomacy: [ambassador*, diplomat*, embassy, treaty]
military: [military, soldier*, air force, marine, navy, army]
nature: [water, wind, sand, forest, mountain, desert, animal, human]

In the mosaic plot, width and height of the columns show the proportions of counties and topics, respectively. Since the categories are per-defined here, it is much easier to interpret the result than in unsupervised topic classification by LDA.

By the way, if you want to produce the plot just pass a cross table of topic and country to mosaicplot()

top <- head(sort(table(data$country), decreasing = TRUE), 20)
tb <- table(data$country, data$topic)[names(top),]
mosaicplot(tb, border = 0, col = RColorBrewer::brewer.pal(6, "Set2"), main = "")

Obstruction to Asian-language text analysis

Standard

In a presentation titled Internationalizing Text Analysis at a workshop on the 27th June at Waseda University, I and Oul Han discussed what obstructing adoption of quantitative text analysis techniques in Japan and Korea. Our question is why there are only few people who do quantitative analysis of Japanese and Korean texts, despite it is becoming one of the mainstream methodologies in North America and Europe? To explain this, we identified four key fields: tools, data, skills, literature.

Tools

We have seen exciting development in text analysis tools in recent years. Support of Unicode has improved dramatically by the stringi package. We have released quanteda that enables analysis of Asian language texts in the same way as English. There have been morphological analysis tools in R (RMeCab and RMecabKo), but RcppMeCab, that supports both Japanese and Korean, has been releases recently. In terms of available tools, there is no reason not embarking on quantitative analysis of Asia texts.

Data

Official political documents are publicly available in both Japan and Korea, but unofficial political documents such as election manifestos are not. Further, media texts are generally more difficult to collect because of copy-right protection. While Korean newspaper articles are available in KINDS and the Dow Jones Factiva database, Japanese newspaper articles are only available in the publishers’ commercial databases. It takes time to improve accessibility to textual data, but we should start making exhaustive lists of Japanese and Korean sources as a start.

Skills

You need different skills in stages of a text analysis project. Designing social scientific research using quantitative text analysis requires broad knowledge of the techniques and their applications. Data collection often involves access to APIs or use of scrapers, that demand knowledge of machine readable formats (HTML, XML, JSON), and computer programming. Quantitative text analysis is not always statistical, but you still need to know descriptive and inferential statistics (e.g. chi-square, t-test, regression analysis). These skills can be acquired through lectures and seminars, but very few or no text analysis courses are offered in Japanese and Korean universities. Until such courses to become widely available, we need to organize workshops to train future text analysts.

Literature

The lack of standard textbook on social scientific text analysis has been one of the biggest problems, limiting the opportunity to acquire the above-mentioned skills to people based in North America or Europe. Aiming to address this problem, I created an online textbook with Stefan Müller, but pages are all in English. I recently added a section to explain language-specific pre-processing, but there is only one page for Japanese. We should translate the online textbook to other languages and add more pages on how to handle Asia languages texts.

If you want to know more, please see the slides.