Newsmap is back


International Newsmap has been offline for a while due to a restriction imposed by my server hosting company. The number of news stories in Newsmap has been increasing since 2011, and the company decided to disable the database. I left it offline, lacking motivation to restore the system, but an email from an email from a student in an international journalism course in Portugal encourage me to do the job.

I moved the website to a new, more powerful server and upgraded its user interface as well as the system behind. The media sources are reduced to 11 newspapers and satellite news, which I think interesting. Geographical classifier is updated to the latest version, and training is solely based on stories collected from Yahoo News.

In the new user interface, there is not button to switch between daily, monthly or yearly, but if you click + signs next to the dates in the right-hand side, you can see monthly or daily figures.

Analysis of Russian media


Application of the techniques developed with English language texts to other languages is not so easy, but I managed to adapt my LSS system to Russia language for a project on Russian media framing of street protests. In the project, I am responsible for data collection and analysis of Russian language news collected from state-controlled media between 2011-2014. The dictionary that I created measures a dimension of protests as freedom of expression vs. social disorder as good as human coders do. The detail of the dictionary construction procedure is available in one of my posts. I will keep positing to the project blog.

Countries with state-owned news agencies


It is only little recognized, even among the students of mass media, that international news system is a network of national or regional news agencies, and that many of those are state-owned. Fully commercial agencies like Reuters are very rare, and even international news agencies, such as AFP, are often subsidized by the government. In order to have a broad picture of the state-ownership of news agencies, I collected information from BBC’s media profile, and identified countries that have state-run news agencies. It turned out that among 114 countries in the source, 40.3% of the countries have state-run agencies.

In this plot, red-colored countries have state-run news agencies, and we notice that they usually neither have very small or large economies measured by GDP, because large domestic media markets increase independence of news agencies by commercial operation, and small economies simply cannot support national news agencies.

The more important is the concentration of the state-owned agencies in countries with limited press freedom: press freedom below 40 points is considered ‘not free’ by the Freedom House. This means that news reports from those state-run news agencies in unfree countries might be biased in favor of the government, and those stories come into the international news distribution system. Are the foreign news stories we read free from such biases?


ITAR-TASS’s coverage of annexation of Crimea


My main research interest is estimation of media biases using text analysis techniques. I did a very crude analysis of ITAR-TASS’s coverage of the Ukraine crisis two years ago, but it is time to redo everything with more sophisticated tools. I created a positive-negative dictionaries for democracy and sovereignty, and applied them to see how the Russian news agency cover events related to annexation of Crimea.

In the both charts, TASS’s (red) coverage shifts positively during the period between the announcement of the annexation (K1) and the referendum (K3). The change is visible not only in the absolute terms, but also in relation to Interfax’s (blue).


The positive shift is due to the positive coverage of two key events by TASS. When the mean score of the +-3 days of K2, when the question of the referendum was changed from independence from Ukraine to annexation to Russia, is calculated, its stories on Crimea sovereignty appear to be really positive (11.7 points higher than Interfax; p < 0.01). The second high point is the day of the referendum (K3), of course, when more than 95% of Crimeans allegedly voted for annexation. For the seven days period, the state of the democracy in Crimea becomes very good in TASS's news stories (4.09 point higher than Interfax; p < 0.02). Why can I compare TASS with Interfax? It is because their framing of Ukraine, excluding Crime, (bold) is more or less the same during the same period, and difference only found in Crimea must be due to difference in their status i.e. TASS is state-owned, while Interfax is commercial, and the interest of the Kremlin in Crimea.

Russia’s foreign policy priority


Methodological papers are tasteless and boring without nice examples. For an exemplary application of my Newsmap, I downloaded all the news stories published by ITAR-TASS news agency from 2009 to 2014 both in English and Russian.

From a public diplomacy point of view, I was interested in which countries are receiving the highest coverage in the Russian official news agency’s English service. In my analysis, over 660,000 stories were classified according to their geographical news coverage, and volumes of English news stories were compared with the Russian counterparts for each country to produce English-Russian ratios.

The result of the analysis was striking: the ratios clearly reflects Russia’s foreign policy priority last few years. The significantly high ratios were found in Ukraine, Georgia, Kazakhstan, Kyrgyzstan, and Belarus. If those countries are plotted by year, it is even more interesting. Ukraine (UA) is high in 2008 and 2014, corresponding to the gas price dispute and the revolution; Geogia (GE) is important after the war but rapidly falls down; Kyrgyzstan’s (KG) ratio sharply increases in 2010 because of the revolution; Kazakhstan (KZ) and Belarus (BY) have similar patterns in 2009-2014 and their peaks are in 2011 because they increased economic partnership with Russia in 2011 that resulted in a Single Economic Space.

Sentence segmentation


I believe that sentence is the optimal unit of sentiment analysis, but splitting whole news articles into sentences is often tricky because there are a lot of quotations in news. If we simply chop up texts based on punctuations, we get quoted texts are split into different sentences. This code is meant to avoid such problems as much as possible. This code is original written for Russian language texts but should work with English now.

unitize <- function(df_items, len_min=10, quote='"'){ # Input has to be data frame with 'tid' and 'body' vairables 

  df_units <- data.frame()
  for(i in 1:nrow(df_items)){
    body <- insertSeparator(df_items$body[i], len_min, quote)
      units <- unlist(strsplit(body, '|', fixed=TRUE))
      flags <- unlist(lapply(units, function(x) grepl('[a-zA-Z0-9]', x))) # Language dependent
      units <- units[flags]
      len <- length(units)
      units <- stri_replace_all_fixed(units, '|', ' ') # Remove separator
      units <- stri_replace_all_regex(units, '\\s\\s+', ' ') # Remove duplicated spaces
      units <- stri_trim_both(units)
      df_temp <- data.frame(tid=rep(df_items$tid[i], len), uid=1:len, text=units, stringsAsFactors=FALSE)
      df_units <- rbind(df_units, df_temp)
  write.table(df_units, file='item_units.csv', sep="\t", quote=TRUE, qmethod="double")

insertSeparator <- function(text, len_min=10, quote){
  flag_quote <- FALSE
  flag_bracket <- FALSE
  text <- stri_replace_all_regex(text, '([^.!?]) \\| ', '$1 ') # Remove wrong paragraph separator
  tokens <- stri_split_fixed(text, ' ', simplify=TRUE)
  tokens2 <- c()
  len <- 0
  for(token in tokens){
    # Reset flag by the paragraph separator
    if(stri_detect_fixed(token, '|')){ 
      flag_quote <- FALSE 
      flag_bracket <- FALSE
    # Set flags
    flag_quote <- xor(flag_quote, stri_count_fixed(token, quote) == 1) # Exlcuded one-word quotaiton
    if(stri_detect_fixed(token, '(') != stri_detect_fixed(token, ')')){ 
      if(stri_detect_fixed(token, '(')) flag_bracket <- TRUE # Exlcuded one-word bracket
      if(stri_detect_fixed(token, ')')) flag_bracket <- FALSE # Exlcuded one-word bracket
    if(len < len_min){
      if(!stri_detect_fixed(token, '|')){
        tokens2 <- c(tokens2, token)
        len <- len + 1
      if(stri_detect_fixed(token, '|')){
        tokens2 <- c(tokens2, token)
        len <- 0 
      }else if(!flag_quote & !flag_bracket & stri_detect_regex(token, '([.!?])$')){
        tokens2 <- c(tokens2, token, '|') # Insert split mark
        len <- 0
        tokens2 <- c(tokens2, token)
        len <- len + 1
    #cat(token, flag_quote, flag_bracket, len, "\n")
  text2 <- paste(tokens2, collapse=' ')

Nexis news importer updated


I posted the code Nexis importer last year, but it tuned out that the HTML format of the database service is less consistent than I though, so I changed the logic. The new version is dependent less on the structure of the HTML files, but more on the format of the content.

library(XML) #might need libxml2-dev via apt-get command

readNewsDir <- function(dir,...){
  names <- list.files(dir, full.names = TRUE, recursive = TRUE)
  df <- data.frame(head = c(), body = c(), pub = c(), datetime = c(), edition = c(), length = c(), stringsAsFactors = FALSE)
  for(name in names){
    if(grepl('\\.html$|\\.htm$|\\.xhtml$', name, = TRUE)){
      df <- rbind(df, readNexisHTML(name, ...))

#readNexisHTML('/home/kohei/Documents/Syria report/nexis.html')
readNexisHTML <- function(name, sep = ' '){
  heads <- c()
  bodies <- c()
  bys <- c()
  pubs <- c()
  datetimes <- c()
  editions <- c()
  lengths <- c()
  #Convert format
  cat('Reading', name, '\n')
  # HTML cleaning------------------------------------------------
  lines <- scan(name, what="character", sep='\n', quiet=TRUE, encoding = "UTF-8")
  docnum <- 0
  for(i in 1:length(lines)){
    lines[i] <- gsub('', '', lines[i])
    lines[i] <- gsub(' -->', '', lines[i])
  lines[i+1] = '' # Fix EOF problem
  html <- paste(lines, collapse='\n')
  # Write to debug
  #cat(html, file="converted.html", sep="", append=FALSE)
  # Main process------------------------------------------------
  #Load as DOM object
  doc <- htmlParse(html , encoding="UTF-8")
  # Remove index
  indexns <- getNodeSet(doc, '/html/body//doc[.//table]')
  for(indexn in indexns){
  for(node in getNodeSet(doc, '/html/body//doc')){
    pub <- NA
    datetime <- NA
    head <- NA
    by <- NA
    edition <- NA
    section <- NA
    length <- NA
    i <- 1
    for(div in getNodeSet(node, './/div')){
      value <- cleanNews(xmlValue(div))
      #print(paste(i, value))
      if(i == 1 & grepl('\\d+ of \\d+ DOCUMENTS', value)){
        i <- 2
      }else if(i == 2){
        #print(paste('pub', value))
        pub <- value
        i <- 3
      }else if(i == 3 & grepl('^(January|February|March|April|May|June|July|August|September|October|November|December)', value)){
        dateline <- value
        #print(paste('date', value))
        match <- regexpr(paste0('(January|February|March|April|May|June|July|August|September|October|November|December)',
                     '[, ]+([0-9]{1,2})',
                     '[, ]+([0-9]{4})',
                     '([,; ]+(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday))?',
                     '([, ]+(.+))?'), value, perl=TRUE)
        date <- c()
        for(j in 1:length(attr(match, "capture.start"))){
          from <- attr(match, "capture.start")[j]
          to <- attr(match, "capture.start")[j] + attr(match, "capture.length")[j]
          date <- c(date, substr(dateline, from, to))
        month <- gsub('[^a-zA-Z]', '', date[1])
        day <- gsub('[^0-9]', '', date[2])
        year <- gsub('[^0-9]', '', date[3])
        datetime <- format(strptime(paste(month, day, year, '12:00 AM'), 
                                   format='%B %d %Y %I:%M %p'), '%Y-%m-%d %H:%M:%S UTC')
        if(length(date) == 7){
          edition <- cleanNews(date[7])
        i <- 4
      }else if(i == 4 & !grepl('[A-Z]+:', value)){
        head <- value # Sometimes does not exists
        i <- 8
      }else if(i >= 4 & grepl('BYLINE:', value)){
        by <- sub('BYLINE: ', '', value)
        i <- 8
      }else if(i >= 4 & grepl('SECTION:', value)){
        section <- sub('SECTION: ', '', value)
        i <- 8
      }else if(i >= 4 & grepl('LENGTH:', value)){
        length <- strsplit(value, ' ')[[1]][2]
        i <- 8
      }else if(i >= 4 & grepl('[A-Z]+:', value)){
        i <- 8
      }else if(i == 8){
        paras <- c()
        for(p in getNodeSet(div, 'p')){ 
          paras <- c(paras, cleanNews(xmlValue(p)))
        if(length(paras) > 0){
          body <- paste(paras, sep = '', collapse=sep)
      heads <- c(heads, head)
      bodies <- c(bodies, body)
      bys <- c(bys, by)
      pubs <- c(pubs, pub)
      datetimes <- c(datetimes, datetime)
      editions <- c(editions, edition)
      lengths <- c(lengths, length)

  return(data.frame(head = as.character(heads), 
                    pub = as.character(pubs), 
                    datetime = as.POSIXct(datetimes, tz = 'UTC'),
                    by = as.factor(bys), 
                    edition = as.character(editions), 
                    length = as.numeric(lengths),
                    body = as.character(bodies), 
                    stringsAsFactors = FALSE))

cleanNews <- function(text){
  text <- gsub("\\r\\n|\\r|\\n|\\t", " ", text)
  text <- gsub("[[:cntrl:]]", " ", text, perl = TRUE)
  text <- gsub("\\s\\s+", " ", text)
  text <- gsub("^\\s+|\\s+$", "", text)

Geographical dictionary making technique


My new draft paper Newsmap: Dictionary expansion technique for geographical classification of very short longitudinal texts explains how to create a large geographical dictionary for text classification. Its algorithm is an updated version of the International Newsmap, and it is simpler and more statistically grounded. As I am arguing in the paper, this technique could be used to classify not only news stories, but social media posts.