Applying LIWC dictionary to a large dataset

Standard

LIWC is a popular text analysis package developed and maintained by Pennebaker et al. The latest version of the LIWC dictionary was released in 2015. This dictionary seems more appropriate than classic dictionaries such as the General Inquire dictionaries for analysis of contemporary materials, because our vocabulary changes over years.

However, LIWC did not work with a large corpus of news articles published between 2012-2015 (around 800MB in raw text). The error seems to show that the text file is too large for the software:

java.util.concurrent.ExecutionException: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at com.liwc.LIWC2015.controller.TextAnalyzer.run(TextAnalyzer.java:109)
    at com.liwc.LIWC2015.controller.MainMenuController.onAnalyzeText(MainMenuController.java:113)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
    at javafx.fxml.FXMLLoader$MethodHandler.invoke(FXMLLoader.java:1771)
    at javafx.fxml.FXMLLoader$ControllerMethodEventHandler.handle(FXMLLoader.java:1657)

My solution to the problem was to apply the LIWC dictionary using quanteda‘s dictionary lookup function – it could apply the dictionary to the data less the one minute on my Core i7 machine. I compared the results from quanteda and LIWC using a subset of the corpus, and found the word counts (in columns from “function” to “you” in the tables) very close to each other:

dict <- dictionary(file = './Text analysis/LIWC/LIWC2015_English_Flat.dic')
corp <- corpus(readLines('./Text analysis/Corpus/guardian_sub.txt'))
toks <- tokens(corp, remove_punct = TRUE)
toks_liwc <- tokens_lookup(toks, dict)
mx_liwc <- dfm(toks_liwc) / ntoken(toks) * 100
head(mx_liwc, 20)

Document-feature matrix of: 10,000 documents, 73 features (21.8% sparse).
(showing first 20 documents and first 6 features)
        features
docs     function   pronoun     ppron          i        we        you
  text1  43.57743  6.122449 1.4405762 0.12004802 0.7202881 0.12004802
  text2  42.94872  5.769231 0.6410256 0.00000000 0.0000000 0.00000000
  text3  43.94904  6.157113 1.6985138 0.00000000 0.2123142 0.00000000
  text4  42.12963  4.783951 1.3888889 0.15432099 0.4629630 0.15432099
  text5  40.22140  5.289053 2.7060271 0.00000000 0.6150062 0.12300123
  text6  43.44473  4.755784 0.6426735 0.00000000 0.2570694 0.00000000
  text7  41.03139  4.035874 0.2242152 0.00000000 0.0000000 0.00000000
  text8  43.82716  8.847737 6.3786008 1.02880658 0.8230453 0.00000000
  text9  42.56121  4.519774 1.3182674 0.00000000 0.3766478 0.00000000
  text10 46.11111  6.888889 1.8888889 0.44444444 0.1111111 0.22222222
  text11 49.62963 12.469136 5.5555556 1.60493827 1.1111111 0.12345679
  text12 50.00000 11.121495 6.8224299 1.02803738 2.5233645 0.00000000

Note that quanteda version 0.99 has a problem in dfm_lookup(), which slows down computation dramatically. If you want to use this function, install version 0.996 or later (available on Github).

早稲田大学で多言語テキスト分析法について発表

Standard

早稲田大学の政治学研究科セミナーにて、『バイリンガル分析へのデータ駆動アプローチ:30年間の日英新聞における米国外交政策の表象』と題するプレゼンテーションを行いました。当プレゼンテーションは、アメリカの政治・外交について研究プロジェクトにおいて、異なる言語(英語と日本語)の文書に対して同一の量的テキスト分析手法を適用する方法に関するものです。本セミナーで発表した手法の一部は、5月22日の15時から行われる日本語の量的テキスト分析に関するワークショップでより具体的に説明します。

Upcoming presentation at Waseda University

Standard

I am invited to present a new approach to comparative text analysis in a research seminar at Waseda Universtiy (Tokyo) on 17th. My talk is titled Data-driven approach to bilingual text analysis: representation of US foreign policy in Japanese and British newspapers in 1985-2016.

Kohei Watanabe will present a new approach to text analysis of historical data in a research project on media representation of US foreign policy (with Prof. Peter Trubowitz). In this project, he analyses how Japanese and British newspapers covered US government’s commitment to its most important allies during the last 30 years. Taking Asahi Shimbun and London Times as examples, he will demonstrate techniques to redefine word boundaries and to expand keyword dictionaries with statistical models trained on a large news corpus. These techniques are equally applicable to both Japanese and English texts, improving overall accuracy and comparability of analytical results. The techniques to be presented are widely accessible in quanteda, a quantitative text analysis package in R, which he develops as one of the main contributors.

Redefining word boundaries by collocation analysis

Standard

Quanteda’s tokenizer can segment Japanese and Chinese texts thanks to stringi, but its results are not always good, because its underlying function, ICU, recognizes only limited number of words. For example, this Japanese text

"ニューヨークのケネディ国際空港"

can be translated to “Kennedy International Airport (ケネディ国際空港) in (の) New York (ニューヨーク)”. Quanteda’s tokenizer (tokens function) segments this into too small pieces:

"ニュー"       "ヨーク"       "の"           "ケネディ"     "国際"         "空港"

Apparently, the first two words should not be separated. The standard Japanese POS tagger, Mecab, does just this:

"ニューヨーク" "の"           "ケネディ"     "国際"         "空港"

However, the erroneous segmentation can be corrected by running quaneda’s sequences function on a large corpus of news to identify contiguous collocations. After the correction of the word boundaries both the first (ニューヨーク) and last (国際空港) parts are joined together.

"ニューヨーク" "の"             "ケネディ"     "国際空港"

This is exactly the same approach to phrases and multi-word names in English texts. The process of word boundary correction is a series of collocation analysis and token concatenation. The data used to discover collocation comprises 138,108 news articles.

load('data_corpus_asahi_q10.RData')
toks <- tokens(corpus_segment(data_corpus_asahi_q10, what = "other", delimiter = "。"), include_docvars = TRUE)

toks <- tokens_select(toks, '^[0-9ぁ-んァ-ヶー一-龠]+$', valuetype = 'regex', padding = TRUE)

min_count <- 50

# process class of words that include 国際 and 空港
seqs_kanji <- sequences(toks, '^[一-龠]+$', valuetype = 'regex', nested = FALSE, 
                        min_count = min_count, ordered = FALSE) 
toks <- tokens_compound(toks, seqs_kanji[seqs_kanji$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

# process class of words that include ニュー and ヨーク
seqs_kana <- sequences(toks, '^[ァ-ヶー]+$', valuetype = 'regex', nested = FALSE, 
                       min_count = min_count, ordered = FALSE) 
toks <- tokens_compound(toks, seqs_kana[seqs_kana$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

# process both classes of words
seqs <- sequences(toks, '^[0-9ァ-ヶー一-龠]+$', valuetype = 'regex', nested = FALSE, 
                  min_count = min_count, ordered = FALSE)
toks <- tokens_compound(toks, seqs[seqs$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

saveRDS(toks, 'data_tokens_asahi.RDS')

Analyzing Asian texts in R on English Windows machines

Standard

R is generally good with Unicode, and we do not see garbled texts as far as we use stringi package. But there are some known bugs. The worst is probably the bug that have been discussed on the online community.

On Windows, R prints character vectors properly, but not character vectors in data.frame:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

> txt <- "あいうえお" # Japanese text
> print(txt)
[1] "あいうえお" # good
> print(data.frame(txt))
                                       txt
1  # not good

While Ista Zahn’s interesting post only shows the depth of the problem, there is a solution (or work around) that you can try:

First, set language for non-Unicode programs in Windows’ Control Panel > Clock, Language, and Region > Language > Language for non-Unicode program > Change system locale.

Second, set locale in R script:

> Sys.setlocale("LC_CTYPE", locale="Japanese") # set locale
[1] "Japanese_Japan.932"

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=Japanese_Japan.932 # changed          
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

Then, the Japanese text in a data.frame is printed correctly:

> txt <- "あいうえお" # Japanese text
> print(txt)
[1] "あいうえお" # good
> print(data.frame(txt))
         txt
1 あいうえお # good

R and Python text analysis packages performance comparison

Standard

Like many other people, I started text analysis in Python, because R was notoriously slow. Python looked like a perfect language for text analysis, and I did a lot of work during my PhD using gensim with home-grown tools. I loved gensim’s LSA that quickly and consistently decomposes very large document-feature matrices.

However, I faced a memory crisis in Python when the size of data for my projects continued grow, reaching 500MB. When 500MB of texts are tokenized in Python, it took nearly 10GB of RAM. This problem is deep-rooted in Python’s list object that is used to store character strings. The only solution seemed to convert all the tokens into integers (serialization) in early stages of the text processing, but development of such platform in Python is a huge undertaking.

After joining the quanteda team last summer, I have spent a lot of time to improve its performance in a new architecture. I implemented up-front serialization (tokens function), and wrote a bunch of multi-thread functions to modify the serialized data in C++ (many of the tokens_* functions). If tokens are serialized, creation of a sparse document-feature matrix (dfm) is quick and easy.

The performance gain of quanteda’s new architecture became apparent in the head-to-head comparison with gensim. Quanteda’s execution time is around 50% shorter, and peak memory consumption is 40% smaller than gensim.


The data used for this benchmarking is a corpus of 117,942 news stories published by London Times. The operations are reading texts from a disk, tokenizing, removing stopwords, and constructing a sparse document-feature matrix. Execution time and peak memory consumption are obtained from ‘Elapsed time’ and ‘Maximum resident set size’ in GNU’s time command. The package versions are 0.9.9.48 (quanteda) and 2.0.0 (gensim).

R

#!/usr/bin/Rscript
require(quanteda)
require(stringi)

cat("Read files\n")
txts <- stri_read_lines('data.txt') # using stri_read_lines because readLines is very slow

cat("Tokenize texts\n")
toks = tokens(txts, what = "fastestword") # 'fastestword' means spliting text by spaces

cat("Remove stopwords\n")
toks = tokens_remove(toks, stopwords('english'))

mx <- dfm(toks)

cat(nfeature(mx), "unique types\n")

Python

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import division
import os, sys, string, codecs
from gensim import corpora, models


if __name__ == "__main__":
    
    print "Read files"
    txts = []
    with codecs.open('data.txt', 'r', 'utf-8') as fp:
        for txt in fp:
            if len(txt.strip()) > 0:
                txts.append(txt.strip())
                
    # stopwords are imported from quanteda            
    stopwords = set(["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", 
                     "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", 
                     "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", 
                     "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "would", 
                     "should", "could", "ought", "i'm", "you're", "he's", "she's", "it's", "we're", "they're", "i've", "you've", "we've", 
                     "they've", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", 
                     "isn't", "aren't", "wasn't", "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", 
                     "shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "let's", "that's", "who's", "what's", "here's", "there's", 
                     "when's", "where's", "why's", "how's", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", 
                     "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", 
                     "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", 
                     "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", 
                     "only", "own", "same", "so", "than", "too", "very"])
    
    print "Tokenize and remove stopwords"
    toks = [[tok for tok in txt.lower().split() if tok not in stopwords] for txt in txts]
    
    print "Serialize tokens"
    dic = corpora.Dictionary(toks)
    
    print "Construct a document-feature matrix"
    mx = [dic.doc2bow(tok) for tok in toks]
    
    print(dic)