At the LSE Computational Social Science hackathon, I presented how to develop text analysis models using **quanteda**‘s core API’s such as `as.tokens()`

, `as.dfm()`

and `pattern2id()`

. All the slides and the files available are in my Github repository.

# Month: April 2018

# Analyze big data with small RAM

StandardA lot of people are using **quanteda** to analyze social media posts because it is very fast and flexible, but they sometimes face dramatic slow down due to memory swapping caused by insufficient sizes of RAM. **quanteda** requires the size of RAM to be 5 times larger than the data to analyze, but it can be 10 times when the data is comprised of many short documents. For example, in the first block of code, the original texts (`txt`

) are only 286 MB on memory, but the tokens object (`toks`

) is 987.8 MB and the document-feature matrix (`mt`

) is 1411.8 MB (measured by `object.size()`

):

```
require(quanteda)
txt <- readLines('tweets.txt') # 286 MB
length(txt) # 3000000
toks <- tokens(txt) # 987.8 MB
mt <- dfm(toks) # 1411.8 MB
```

I recommend users to install as much RAM as possible into their machines, but there is a way to analyze big data with small RAM. You cannot avoid having the large document-feature matrix because this is what you need to analyze, but we can skip the tokens object as it is only an intermediate here.

In the second bloc of code, I split the data (`txt`

) into chunks of 10,000 documents and pass them to `dfm()`

chunk-by-chunk to avoid creating a larger tokens object. `dfm()`

still creates a tokens object internally but the size is only around 5MB for that number of documents. Output `dfm()`

is then append to `mt`

using `rbind()`

so that I have all the documents in `mt`

in the end. `gc()`

asks R’s garbage collector to delete unused objects to release memory.

```
index <- seq(length(txt))
batch <- split(index, ceiling(index / 10000)) # each batch has 10000 indices
for (i in batch) {
cat("Constructing dfm from", i[1], "\n")
if (i[1] > 1) {
mt <- rbind(mt, dfm(txt[i]))
} else {
mt <- dfm(txt[i])
}
gc()
}
ndoc(mt) # 3000000
```

You might wonder why the size of a chunk is 10,000 and not 20,000 or 50,000,? The reason is that **quanteda** performs tokenization iteratively for every 10,000 documents, and it is the largest possible size that do not trigger another loop.

# Relaxing R version requirement

StandardUntil **quanteda** v1.1, our users needed to have R 3.4.0 installed, but we relax the requirement to R 3.1.0, because people working in companies or other large organizations often do not have latest version of R in their computers, and therefore cannot use our package. To investigate why **quanteda** requires R 3.4.0 quickly, I wrote a function called `print_depends()`

:

```
require(stringi)
print_depends <- function(package, level = 0) {
desc <- packageDescription(package)
if ("Depends" %in% names(desc)) {
dep <- stri_trim_both(desc$Depends)
r <- stri_extract_first_regex(dep, 'R \\(.*?\\)')
if (is.na(r)) r <- ''
cat(strrep(' ', level), package, ": " , r, "\n", sep = '')
if ("Imports" %in% names(desc)) {
imp <- paste0(desc$Depends, ', ', desc$Imports)
if (r != '')
imp <- stri_replace_first_fixed(imp, r, '')
imp <- stri_replace_all_regex(imp, ' \\(.*?\\)', '')
imp <- unlist(stri_split_regex(imp, ','))
imp <- stri_trim_both(imp)
imp <- imp[imp != '']
for(i in imp) {
print_depends(i, level + 1)
}
}
}
}
print_depends('quanteda')
```

The output showed that **quanteda** needs R 3.4.0 only because **slam** requires. Our package depends on **slam** because of **wordcloud**.

```
quanteda: R (>= 3.4.0)
extrafont: R (>= 2.15)
extrafontdb: R (>= 2.14)
Rttf2pt1: R (>= 2.15)
digest: R (>= 2.4.1)
Matrix: R (>= 3.0.1)
lattice: R (>= 3.0.0)
wordcloud:
RColorBrewer: R (>= 2.0.0)
slam: R (>= 3.4.0)
Rcpp: R (>= 3.0.0)
sna: R (>= 2.0.0)
network: R (>= 2.10)
ggrepel: R (>= 3.0.0)
ggplot2: R (>= 3.1)
digest: R (>= 2.4.1)
gtable: R (>= 2.14)
MASS: R (>= 3.1.0)
plyr: R (>= 3.1.0)
Rcpp: R (>= 3.0.0)
reshape2: R (>= 3.1)
plyr: R (>= 3.1.0)
Rcpp: R (>= 3.0.0)
Rcpp: R (>= 3.0.0)
stringr: R (>= 3.1)
glue: R (>= 3.1)
stringi: R (>= 2.14)
scales: R (>= 2.13)
RColorBrewer: R (>= 2.0.0)
dichromat: R (>= 2.10)
plyr: R (>= 3.1.0)
Rcpp: R (>= 3.0.0)
Rcpp: R (>= 3.0.0)
R6: R (>= 3.0)
viridisLite: R (>= 2.10)
tibble: R (>= 3.1.0)
cli: R (>= 2.10)
rlang: R (>= 3.1.0)
lazyeval: R (>= 3.1.0)
Rcpp: R (>= 3.0.0)
scales: R (>= 2.13)
RColorBrewer: R (>= 2.0.0)
dichromat: R (>= 2.10)
plyr: R (>= 3.1.0)
Rcpp: R (>= 3.0.0)
Rcpp: R (>= 3.0.0)
R6: R (>= 3.0)
viridisLite: R (>= 2.10)
RcppParallel: R (>= 3.0.2)
RSpectra: R (>= 3.0.2)
Matrix: R (>= 3.0.1)
lattice: R (>= 3.0.0)
Rcpp: R (>= 3.0.0)
stringi: R (>= 2.14)
ggplot2: R (>= 3.1)
digest: R (>= 2.4.1)
gtable: R (>= 2.14)
MASS: R (>= 3.1.0)
plyr: R (>= 3.1.0)
Rcpp: R (>= 3.0.0)
reshape2: R (>= 3.1)
plyr: R (>= 3.1.0)
Rcpp: R (>= 3.0.0)
Rcpp: R (>= 3.0.0)
stringr: R (>= 3.1)
glue: R (>= 3.1)
stringi: R (>= 2.14)
scales: R (>= 2.13)
RColorBrewer: R (>= 2.0.0)
dichromat: R (>= 2.10)
plyr: R (>= 3.1.0)
Rcpp: R (>= 3.0.0)
Rcpp: R (>= 3.0.0)
R6: R (>= 3.0)
viridisLite: R (>= 2.10)
tibble: R (>= 3.1.0)
cli: R (>= 2.10)
rlang: R (>= 3.1.0)
lazyeval: R (>= 3.1.0)
XML: R (>= 2.13.0)
lubridate: R (>= 3.0.0)
stringr: R (>= 3.1)
glue: R (>= 3.1)
stringi: R (>= 2.14)
Rcpp: R (>= 3.0.0)
spacyr: R (>= 3.0.0)
data.table: R (>= 3.0.0)
reticulate: R (>= 3.0)
jsonlite:
Rcpp: R (>= 3.0.0)
stopwords: R (>= 2.10)
ISOcodes: R (>= 2.10.0)
```

Since we never use **tm** for text processing, we decided to write our own function for word cloud incorporating the publicly available code and drop **tm** from our dependencies. Since this change in v1.1.0, **quanteda** and other packages that depend on it (e.g. **tidytext**) can be installed only with R 3.1.0, which was released over three years ago. I hope

this change to make quantitative text analysis more accessible to R users, especially to those in the industry.