I believe that sentence is the optimal unit of sentiment analysis, but splitting whole news articles into sentences is often tricky because there are a lot of quotations in news. If we simply chop up texts based on punctuations, we get quoted texts are split into different sentences. This code is meant to avoid such […]
Nexis news importer updated
I posted the code Nexis importer last year, but it tuned out that the HTML format of the database service is less consistent than I though, so I changed the logic. The new version is dependent less on the structure of the HTML files, but more on the format of the content. library(XML) #might need […]
The Latent Semantic Scaling
I have posted document scaling results on different dimensions such as political left-right, and immigration positive-negative on this blog previously, but I did not explain the detail of the technique, call the Latent Semantic Scaling. The LSS is a type of lexicon expansion technique based on the Latent Semantic Analysis. Please have a look at […]
Human-coded test data for geographical classification
Early this year, I crated a sizable human-coded test data for my news classifier using the Prolific Academic service, and the data set is now ready for download. The data is comprised of 5,000 news summaries collected from RSS feeds of the New York Times, The Times (UK), The Australian, Times of India, and Daily […]