Applying LIWC dictionary to a large dataset

Standard

LIWC is a popular text analysis package developed and maintained by Pennebaker et al. The latest version of the LIWC dictionary was released in 2015. This dictionary seems more appropriate than classic dictionaries such as the General Inquire dictionaries for analysis of contemporary materials, because our vocabulary changes over years.

However, LIWC did not work with a large corpus of news articles published between 2012-2015 (around 800MB in raw text). The error seems to show that the text file is too large for the software:

java.util.concurrent.ExecutionException: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at com.liwc.LIWC2015.controller.TextAnalyzer.run(TextAnalyzer.java:109)
    at com.liwc.LIWC2015.controller.MainMenuController.onAnalyzeText(MainMenuController.java:113)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
    at javafx.fxml.FXMLLoader$MethodHandler.invoke(FXMLLoader.java:1771)
    at javafx.fxml.FXMLLoader$ControllerMethodEventHandler.handle(FXMLLoader.java:1657)

My solution to the problem was to apply the LIWC dictionary using quanteda‘s dictionary lookup function – it could apply the dictionary to the data less the one minute on my Core i7 machine. I compared the results from quanteda and LIWC using a subset of the corpus, and found the word counts (in columns from “function” to “you” in the tables) very close to each other:

dict <- dictionary(file = './Text analysis/LIWC/LIWC2015_English_Flat.dic')
corp <- corpus(readLines('./Text analysis/Corpus/guardian_sub.txt'))
toks <- tokens(corp, remove_punct = TRUE)
toks_liwc <- tokens_lookup(toks, dict)
mx_liwc <- dfm(toks_liwc) / ntoken(toks) * 100
head(mx_liwc, 20)

Document-feature matrix of: 10,000 documents, 73 features (21.8% sparse).
(showing first 20 documents and first 6 features)
        features
docs     function   pronoun     ppron          i        we        you
  text1  43.57743  6.122449 1.4405762 0.12004802 0.7202881 0.12004802
  text2  42.94872  5.769231 0.6410256 0.00000000 0.0000000 0.00000000
  text3  43.94904  6.157113 1.6985138 0.00000000 0.2123142 0.00000000
  text4  42.12963  4.783951 1.3888889 0.15432099 0.4629630 0.15432099
  text5  40.22140  5.289053 2.7060271 0.00000000 0.6150062 0.12300123
  text6  43.44473  4.755784 0.6426735 0.00000000 0.2570694 0.00000000
  text7  41.03139  4.035874 0.2242152 0.00000000 0.0000000 0.00000000
  text8  43.82716  8.847737 6.3786008 1.02880658 0.8230453 0.00000000
  text9  42.56121  4.519774 1.3182674 0.00000000 0.3766478 0.00000000
  text10 46.11111  6.888889 1.8888889 0.44444444 0.1111111 0.22222222
  text11 49.62963 12.469136 5.5555556 1.60493827 1.1111111 0.12345679
  text12 50.00000 11.121495 6.8224299 1.02803738 2.5233645 0.00000000

Note that quanteda version 0.99 has a problem in dfm_lookup(), which slows down computation dramatically. If you want to use this function, install version 0.996 or later (available on Github).