R and Python text analysis packages performance comparison

Standard

Like many other people, I started text analysis in Python, because R was notoriously slow. Python looked like a perfect language for text analysis, and I did a lot of work during my PhD using gensim with home-grown tools. I loved gensim’s LSA that quickly and consistently decomposes very large document-feature matrices.

However, I faced a memory crisis in Python when the size of data for my projects continued grow, reaching 500MB. When 500MB of texts are tokenized in Python, it took nearly 10GB of RAM. This problem is deep-rooted in Python’s list object that is used to store character strings. The only solution seemed to convert all the tokens into integers (serialization) in early stages of the text processing, but development of such platform in Python is a huge undertaking.

After joining the quanteda team last summer, I have spent a lot of time to improve its performance in a new architecture. I implemented up-front serialization (tokens function), and wrote a bunch of multi-thread functions to modify the serialized data in C++ (many of the tokens_* functions). If tokens are serialized, creation of a sparse document-feature matrix (dfm) is quick and easy.

The performance gain of quanteda’s new architecture became apparent in the head-to-head comparison with gensim. Quanteda’s execution time is around 50% shorter, and peak memory consumption is 40% smaller than gensim.


The data used for this benchmarking is a corpus of 117,942 news stories published by London Times. The operations are reading texts from a disk, tokenizing, removing stopwords, and constructing a sparse document-feature matrix. Execution time and peak memory consumption are obtained from ‘Elapsed time’ and ‘Maximum resident set size’ in GNU’s time command. The package versions are 0.9.9.48 (quanteda) and 2.0.0 (gensim).

R

#!/usr/bin/Rscript
require(quanteda)
require(stringi)

cat("Read files\n")
txts <- stri_read_lines('data.txt') # using stri_read_lines because readLines is very slow

cat("Tokenize texts\n")
toks = tokens(txts, what = "fastestword") # 'fastestword' means spliting text by spaces

cat("Remove stopwords\n")
toks = tokens_remove(toks, stopwords('english'))

mx <- dfm(toks)

cat(nfeature(mx), "unique types\n")

Python

#!/usr/bin/python
# -*- coding: utf-8 -*-

from __future__ import division
import os, sys, string, codecs
from gensim import corpora, models


if __name__ == "__main__":
    
    print "Read files"
    txts = []
    with codecs.open('data.txt', 'r', 'utf-8') as fp:
        for txt in fp:
            if len(txt.strip()) > 0:
                txts.append(txt.strip())
                
    # stopwords are imported from quanteda            
    stopwords = set(["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", 
                     "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", 
                     "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", 
                     "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "would", 
                     "should", "could", "ought", "i'm", "you're", "he's", "she's", "it's", "we're", "they're", "i've", "you've", "we've", 
                     "they've", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", 
                     "isn't", "aren't", "wasn't", "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", 
                     "shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "let's", "that's", "who's", "what's", "here's", "there's", 
                     "when's", "where's", "why's", "how's", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", 
                     "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", 
                     "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", 
                     "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", 
                     "only", "own", "same", "so", "than", "too", "very"])
    
    print "Tokenize and remove stopwords"
    toks = [[tok for tok in txt.lower().split() if tok not in stopwords] for txt in txts]
    
    print "Serialize tokens"
    dic = corpora.Dictionary(toks)
    
    print "Construct a document-feature matrix"
    mx = [dic.doc2bow(tok) for tok in toks]
    
    print(dic)

Leave a Reply

Your email address will not be published. Required fields are marked *