Historical analysis of NYT using web API


We usually use commercial database such as Nexis to download news stories in the past, but you should use New York Times APIs if you want to do historical analysis of news content. We can search NYT news articles until 1851 through the API, and it is free for anyone! We can only download meta-data, including summary texts (lead paragraphs), but we can still do a lot of content analysis with it.

You have to collect a lot of items when each text is short. It should not be difficult to so through the API if you use rtimes package. However, it is actually not as easy as it sound, because web APIs sometimes do not respond, and we can only call the API 1000 times a day. Therefore, our downloader have to be robust against unstable connections, and able to resume downloading next day.

After several attempts, I managed to run download without unexpected errors. Using the code below, you can download summaries of NYT articles that contain ‘diplomacy’ or ‘military’ in their main texts between 1851 and 2000. This program saves downloaded data yearly to RSD files, so that you do not loose, even if you have to restart your R. Do not forget to replace xxxxxxxxxxxxxxxxxxxxxxxxxxxx wit your own API key.

#install.packages("rtimes") rm(list=ls()) require(rtimes) require(plyr) httr::config(timeout = 120) query <- '(body:"diplomacy" OR body:"military")' field <- c("_id", "page", "snippet", "word_count", "score", "headline.main", "headline.print_headline", "byline.original", "web_url") fetch <- function(query, year, page) { res <- as_search(q = NULL, fq = query, begin_date = paste0(year, "0101"), end_date = paste0(year, '1231'), key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx', page = page, fl = c('_id', 'pub_date', 'word_count', 'snippet', 'headline', 'section_name', 'byline', 'web_url')) return(res) } for (year in seq(1851, 2000)) { if (file.exists(paste0('API/temp/', year, '.RDS'))) { cat('Skip', year, "\n") next } cat('Seach', year, "\n") data <- data.frame() res <- NULL page <- 1 while (is.null(res) || res$meta$hits > 10 * page) { res <- NULL attempt <- 0 while (is.null(res) && attempt <= 5) { attempt <- attempt + 1 try( res <- fetch(query, year, page) ) if (is.null(res)) { cat('Error', attempt,'\n') Sys.sleep(30) } if (attempt > 5) { stop('Aborted\n') } } if (nrow(res$data) == 0) { cat('No data\n') break } res$data$page <- page data <- rbind.fill (data, res$data) cat(10 * page, 'of', res$meta$hits, "\n") Sys.sleep(5) page <- page + 1 } if (nrow(data) > 0) { data$year <- year saveRDS(data, file = paste0('API/temp/', year, '.RDS')) } Sys.sleep(5) }

What is the best SVD engine for LSA in R?


I use latent semantic analysis (LSA) to extract synonyms from a large corpus of news articles. I was very happy with Gensim‘s LSA function, but I was not sure how to do LSA in R as good as in Python. There is an R package called lsa, but it is unsuitable for large matrices, because its underlying function svd() calculates all the singular values. Since I usually split documents into sentences in this task, my document-feature matrix is very large and extremely sparse.

It is easy to make an LSA function myself, but the question is which is the best SVD engine in R for this application? rsvd, irlba or RSpectra? The authors claim that their package is the fastest, but it seems depending on the size of the matrix to decompose and the number of singular values to ask for. rsvd seems very fast with small matrices, but it used more than 20GB of RAM on my Linux machine for a matrix created from only 1,000 news articles, while irlba and RSpectra require much less memory space.

I compared irlba and RSpectra in terms of its speed and accuracy using corpora in different sizes. The original corpus is comprised on 300K full-text New York Times news stories on politics. I randomly sampled news stories to construct sub-corpus and removed function words using quanteda for this benchmarking. Arguments of the functions are set in the following way:

# irlba
S <- irlba::irlba(x, nv = 300, center = Matrix::colMeans(x), verbose = FALSE, right_only = TRUE, tol = 1e-5)

# RSpectra
S <- RSpectra::svds(x, k = 300, nu = 0, nv = 300, opts = list(tol = 1e-5))

It is straight forward to measure the speed of the SVD engines: repeatedly create sub-corpora of between 1-10K documents, and record execution time. The result shows that RSpectra is roughly 5 times faster than irlba regardless of the sizes of the corpora.

It is more difficult to gauge the quality of SVD, but I achieved this by calculating cosine similarity of words to an English verb and counting its word stems in top 100 words. For example, when most similar words to ‘ask’ are extracted based on cosine similarity, I expected to find its inflicted forms such as ‘asked’, ‘asks’, ‘asking’ in the top 100 if decomposition is accurate. I cannot tell how many inflicted forms they should extract, but a larger number for the same word suggests higher accuracy. I used 25 common English words, and calculated average number of such words here.

word <- c('want', 'use', 'work', 'call', 'try', 'ask', 'need', 'seem', 
          'help', 'play', 'move', 'live', 'believe', 'happen', 'include', 
          'continue', 'change', 'watch', 'follow', 'stop', 'create', 'open', 
          'walk', 'offer', 'remember')

The differences between RSpectra and irlba aren’t large, but the former still outperformed the latter in all the croups sizes. It is surprising that RSpectra did not compromise its accuracy for its speed. Interestingly, the the curves for both package become flat on the right-hand side, suggesting there is no need to construct corpus larger than 8K documents (~400K sentences) for synonym extraction tasks.

My conclusion based on this benchmarking is that RSpectra is the best for LSA application in R. Nonetheless, since irlba is being actively developed to improve its performance, we should keep eyes of the package too.

Applying LIWC dictionary to a large dataset


LIWC is a popular text analysis package developed and maintained by Pennebaker et al. The latest version of the LIWC dictionary was released in 2015. This dictionary seems more appropriate than classic dictionaries such as the General Inquire dictionaries for analysis of contemporary materials, because our vocabulary changes over years.

However, LIWC did not work with a large corpus of news articles published between 2012-2015 (around 800MB in raw text). The error seems to show that the text file is too large for the software:

java.util.concurrent.ExecutionException: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at com.liwc.LIWC2015.controller.TextAnalyzer.run(TextAnalyzer.java:109)
    at com.liwc.LIWC2015.controller.MainMenuController.onAnalyzeText(MainMenuController.java:113)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
    at javafx.fxml.FXMLLoader$MethodHandler.invoke(FXMLLoader.java:1771)
    at javafx.fxml.FXMLLoader$ControllerMethodEventHandler.handle(FXMLLoader.java:1657)

My solution to the problem was to apply the LIWC dictionary using quanteda‘s dictionary lookup function – it could apply the dictionary to the data less the one minute on my Core i7 machine. I compared the results from quanteda and LIWC using a subset of the corpus, and found the word counts (in columns from “function” to “you” in the tables) very close to each other:

dict <- dictionary(file = './Text analysis/LIWC/LIWC2015_English_Flat.dic')
corp <- corpus(readLines('./Text analysis/Corpus/guardian_sub.txt'))
toks <- tokens(corp, remove_punct = TRUE)
toks_liwc <- tokens_lookup(toks, dict)
mx_liwc <- dfm(toks_liwc) / ntoken(toks) * 100
head(mx_liwc, 20)

Document-feature matrix of: 10,000 documents, 73 features (21.8% sparse).
(showing first 20 documents and first 6 features)
docs     function   pronoun     ppron          i        we        you
  text1  43.57743  6.122449 1.4405762 0.12004802 0.7202881 0.12004802
  text2  42.94872  5.769231 0.6410256 0.00000000 0.0000000 0.00000000
  text3  43.94904  6.157113 1.6985138 0.00000000 0.2123142 0.00000000
  text4  42.12963  4.783951 1.3888889 0.15432099 0.4629630 0.15432099
  text5  40.22140  5.289053 2.7060271 0.00000000 0.6150062 0.12300123
  text6  43.44473  4.755784 0.6426735 0.00000000 0.2570694 0.00000000
  text7  41.03139  4.035874 0.2242152 0.00000000 0.0000000 0.00000000
  text8  43.82716  8.847737 6.3786008 1.02880658 0.8230453 0.00000000
  text9  42.56121  4.519774 1.3182674 0.00000000 0.3766478 0.00000000
  text10 46.11111  6.888889 1.8888889 0.44444444 0.1111111 0.22222222
  text11 49.62963 12.469136 5.5555556 1.60493827 1.1111111 0.12345679
  text12 50.00000 11.121495 6.8224299 1.02803738 2.5233645 0.00000000

Note that quanteda version 0.99 has a problem in dfm_lookup(), which slows down computation dramatically. If you want to use this function, install version 0.996 or later (available on Github).




Upcoming presentation at Waseda University


I am invited to present a new approach to comparative text analysis in a research seminar at Waseda Universtiy (Tokyo) on 17th. My talk is titled Data-driven approach to bilingual text analysis: representation of US foreign policy in Japanese and British newspapers in 1985-2016.

Kohei Watanabe will present a new approach to text analysis of historical data in a research project on media representation of US foreign policy (with Prof. Peter Trubowitz). In this project, he analyses how Japanese and British newspapers covered US government’s commitment to its most important allies during the last 30 years. Taking Asahi Shimbun and London Times as examples, he will demonstrate techniques to redefine word boundaries and to expand keyword dictionaries with statistical models trained on a large news corpus. These techniques are equally applicable to both Japanese and English texts, improving overall accuracy and comparability of analytical results. The techniques to be presented are widely accessible in quanteda, a quantitative text analysis package in R, which he develops as one of the main contributors.

Redefining word boundaries by collocation analysis


Quanteda’s tokenizer can segment Japanese and Chinese texts thanks to stringi, but its results are not always good, because its underlying function, ICU, recognizes only limited number of words. For example, this Japanese text


can be translated to “Kennedy International Airport (ケネディ国際空港) in (の) New York (ニューヨーク)”. Quanteda’s tokenizer (tokens function) segments this into too small pieces:

"ニュー"       "ヨーク"       "の"           "ケネディ"     "国際"         "空港"

Apparently, the first two words should not be separated. The standard Japanese POS tagger, Mecab, does just this:

"ニューヨーク" "の"           "ケネディ"     "国際"         "空港"

However, the erroneous segmentation can be corrected by running quaneda’s sequences function on a large corpus of news to identify contiguous collocations. After the correction of the word boundaries both the first (ニューヨーク) and last (国際空港) parts are joined together.

"ニューヨーク" "の"             "ケネディ"     "国際空港"

This is exactly the same approach to phrases and multi-word names in English texts. The process of word boundary correction is a series of collocation analysis and token concatenation. The data used to discover collocation comprises 138,108 news articles.

toks <- tokens(corpus_segment(data_corpus_asahi_q10, what = "other", delimiter = "。"), include_docvars = TRUE)

toks <- tokens_select(toks, '^[0-9ぁ-んァ-ヶー一-龠]+$', valuetype = 'regex', padding = TRUE)

min_count <- 50

# process class of words that include 国際 and 空港
seqs_kanji <- sequences(toks, '^[一-龠]+$', valuetype = 'regex', nested = FALSE, 
                        min_count = min_count, ordered = FALSE) 
toks <- tokens_compound(toks, seqs_kanji[seqs_kanji$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

# process class of words that include ニュー and ヨーク
seqs_kana <- sequences(toks, '^[ァ-ヶー]+$', valuetype = 'regex', nested = FALSE, 
                       min_count = min_count, ordered = FALSE) 
toks <- tokens_compound(toks, seqs_kana[seqs_kana$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

# process both classes of words
seqs <- sequences(toks, '^[0-9ァ-ヶー一-龠]+$', valuetype = 'regex', nested = FALSE, 
                  min_count = min_count, ordered = FALSE)
toks <- tokens_compound(toks, seqs[seqs$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

saveRDS(toks, 'data_tokens_asahi.RDS')