Factor analysis in R and Python

Standard

Python has a number of statistical modules that allows us to perform analysis without R, but it is always good idea to compare the outputs of different implementations. I performed factor analysis using Scikit-learn module of Python for my dictionary creation system, but the outputs were completely different from that of R’s factanal function just like someone’s post to stackoverflow. After long hours, I finally found that it is because I had’t have normalized data for Scikit-learn. Factanal does normalization automatically, but Scikit-learn doesn’t. The right way of performing factor analysis must be this:

from sklearn import decomposition, preprocessing

data_normal = preprocessing.scale(data) # Normalization
fa = decomposition.FactorAnalysis(n_components=1)
fa.fit(data_normal)
print fa.components_ # Factor loadings

If you I do like this, factor loadings estimated by Scikit-learn become very close to R’s estimates:

# Python (Scikit-learn)
1: 0.24705429
2: 0.56100678
3: 0.48559474
4: 0.54208185
5: 0.50989289
6: 0.33028625
7: 0.38651951

# R (factanal)
1: 0.285719656390773
2: 0.633553717909623
3: 0.493731965398187
4: 0.527418210503982
5: 0.487150249901473
6: 0.312724093202758
7: 0.378827084637606

Testing immigration dictionary

Standard

After making some changes in my automated dictionary creation system, I ran a test to validate the word choice for the new immigration dictionary. Latest version contains fewer intuitively negative words with positive scores, unlike the original version.

The test was performed by comparing the computer content-analysis with human coding of the 2010 UK manifestos. X is the automated coding by the dictionary and Y is the human coding. Green and Conservative are off the 45-degree line, but still the automated coding is strongly corresponding to human coding.

UK 2010

Text analysis dictionary on immigration policy

Standard

Dictionary-based text analysis has a number of good properties, but it is always difficult to make a new dictionary and text analysts often use existing dictionaries that include the General Inquirer dictionaries, which are originally created decades ago, or their derivatives. However, I believe that it is time to create new dictionaries from scratch using a number of tools and techniques available to us.

My first original dictionary is the UK Immigration Dictionary. It is meant to measure attitude toward immigration to the UK. The words contains counter intuitive positive entities such as ‘racist’, but the result becomes as follows when applied to the 2010 UK party manifestos.

BNP          -0.660772785
Coalition     0.403547905
Conservative  0.002508397
Greens       -0.898075732
Labour        0.081029432
LibDem        0.050535076
PC           -0.015306746
SNP          -0.551027977
UKIP         -0.335952325

I am not yet sure how accurate this is, but it looks interesting since small parties, which tend to be against immigration, are all negative.

It is very easy to used the dictionary in R using Quanteda:

options(stringsAsFactors=FALSE)
df.temp <- read.csv(file="news.dictionary.tfidf.500.csv",  header=FALSE, sep='\t')
df.dict <- data.frame(word=as.character(df.temp$V1), score=as.numeric(df.temp$V2))

uk2010immigCorpus <- corpus(uk2010immig,
                            docvars=data.frame(party=names(uk2010immig)),
                            notes="Immigration-related sections of 2010 UK party manifestos",
                            enc="UTF-8")
mx <- tfidf(dfm(uk2010immigCorpus))
mx2 <- as.data.frame.matrix(t(subset(t(mx), colnames(mx) %in% df.dict$word))) #Remove columns not in the dictionary

# Make a list in the same order as the columns
v.dict <- list()
for(word in colnames(mx2)){
  v.dict[[word]] <- df.dict$score[df.dict$word==word]
  #v.dict[[word]] <- ifelse(df.dict$score[df.dict$word==word] > 0, 1, -1)
}

print(as.matrix(mx2) %*% as.matrix(unlist(v.dict)))