Sentence segmentation

I believe that sentence is the optimal unit of sentiment analysis, but splitting whole news articles into sentences is often tricky because there are a lot of quotations in news. If we simply chop up texts based on punctuations, we get quoted texts are split into different sentences. This code is meant to avoid such […]

Nexis news importer updated

I posted the code Nexis importer last year, but it tuned out that the HTML format of the database service is less consistent than I though, so I changed the logic. The new version is dependent less on the structure of the HTML files, but more on the format of the content. library(XML) #might need […]

Testing immigration dictionary

After making some changes in my automated dictionary creation system, I ran a test to validate the word choice for the new immigration dictionary. Latest version contains fewer intuitively negative words with positive scores, unlike the original version. The test was performed by comparing the computer content-analysis with human coding of the 2010 UK manifestos. […]

International Newsmap

I have been running a website called International Newsmap. It collects international news stories from news sites and classify them according to their geographic focus using Bayesian classifier and lexicon expansion technique. The sources of of news are English websites in the US, the UK, New Zealand, India, Singapore, Kenya, and South Africa. The main […]

