Countries with state-owned news agencies


It is only little recognized, even among the students of mass media, that international news system is a network of national or regional news agencies, and that many of those are state-owned. Fully commercial agencies like Reuters are very rare, and even international news agencies, such as AFP, are often subsidized by the government. In order to have a broad picture of the state-ownership of news agencies, I collected information from BBC’s media profile, and identified countries that have state-run news agencies. It turned out that among 114 countries in the source, 40.3% of the countries have state-run agencies.

In this plot, red-colored countries have state-run news agencies, and we notice that they usually neither have very small or large economies measured by GDP, because large domestic media markets increase independence of news agencies by commercial operation, and small economies simply cannot support national news agencies.

The more important is the concentration of the state-owned agencies in countries with limited press freedom: press freedom below 40 points is considered ‘not free’ by the Freedom House. This means that news reports from those state-run news agencies in unfree countries might be biased in favor of the government, and those stories come into the international news distribution system. Are the foreign news stories we read free from such biases?


ITAR-TASS’s coverage of annexation of Crimea


My main research interest is estimation of media biases using text analysis techniques. I did a very crude analysis of ITAR-TASS’s coverage of the Ukraine crisis two years ago, but it is time to redo everything with more sophisticated tools. I created a positive-negative dictionaries for democracy and sovereignty, and applied them to see how the Russian news agency cover events related to annexation of Crimea.

In the both charts, TASS’s (red) coverage shifts positively during the period between the announcement of the annexation (K1) and the referendum (K3). The change is visible not only in the absolute terms, but also in relation to Interfax’s (blue).


The positive shift is due to the positive coverage of two key events by TASS. When the mean score of the +-3 days of K2, when the question of the referendum was changed from independence from Ukraine to annexation to Russia, is calculated, its stories on Crimea sovereignty appear to be really positive (11.7 points higher than Interfax; p < 0.01). The second high point is the day of the referendum (K3), of course, when more than 95% of Crimeans allegedly voted for annexation. For the seven days period, the state of the democracy in Crimea becomes very good in TASS's news stories (4.09 point higher than Interfax; p < 0.02). Why can I compare TASS with Interfax? It is because their framing of Ukraine, excluding Crime, (bold) is more or less the same during the same period, and difference only found in Crimea must be due to difference in their status i.e. TASS is state-owned, while Interfax is commercial, and the interest of the Kremlin in Crimea.

Russia’s foreign policy priority


Methodological papers are tasteless and boring without nice examples. For an exemplary application of my Newsmap, I downloaded all the news stories published by ITAR-TASS news agency from 2009 to 2014 both in English and Russian.

From a public diplomacy point of view, I was interested in which countries are receiving the highest coverage in the Russian official news agency’s English service. In my analysis, over 660,000 stories were classified according to their geographical news coverage, and volumes of English news stories were compared with the Russian counterparts for each country to produce English-Russian ratios.

The result of the analysis was striking: the ratios clearly reflects Russia’s foreign policy priority last few years. The significantly high ratios were found in Ukraine, Georgia, Kazakhstan, Kyrgyzstan, and Belarus. If those countries are plotted by year, it is even more interesting. Ukraine (UA) is high in 2008 and 2014, corresponding to the gas price dispute and the revolution; Geogia (GE) is important after the war but rapidly falls down; Kyrgyzstan’s (KG) ratio sharply increases in 2010 because of the revolution; Kazakhstan (KZ) and Belarus (BY) have similar patterns in 2009-2014 and their peaks are in 2011 because they increased economic partnership with Russia in 2011 that resulted in a Single Economic Space.

Sentence segmentation


I believe that sentence is the optimal unit of sentiment analysis, but splitting whole news articles into sentences is often tricky because there are a lot of quotations in news. If we simply chop up texts based on punctuations, we get quoted texts are split into different sentences. This code is meant to avoid such problems as much as possible. This code is original written for Russian language texts but should work with English now.

unitize <- function(df_items, len_min=10, quote='"'){ # Input has to be data frame with 'tid' and 'body' vairables 

  df_units <- data.frame()
  for(i in 1:nrow(df_items)){
    body <- insertSeparator(df_items$body[i], len_min, quote)
      units <- unlist(strsplit(body, '|', fixed=TRUE))
      flags <- unlist(lapply(units, function(x) grepl('[a-zA-Z0-9]', x))) # Language dependent
      units <- units[flags]
      len <- length(units)
      units <- stri_replace_all_fixed(units, '|', ' ') # Remove separator
      units <- stri_replace_all_regex(units, '\\s\\s+', ' ') # Remove duplicated spaces
      units <- stri_trim_both(units)
      df_temp <- data.frame(tid=rep(df_items$tid[i], len), uid=1:len, text=units, stringsAsFactors=FALSE)
      df_units <- rbind(df_units, df_temp)
  write.table(df_units, file='item_units.csv', sep="\t", quote=TRUE, qmethod="double")

insertSeparator <- function(text, len_min=10, quote){
  flag_quote <- FALSE
  flag_bracket <- FALSE
  text <- stri_replace_all_regex(text, '([^.!?]) \\| ', '$1 ') # Remove wrong paragraph separator
  tokens <- stri_split_fixed(text, ' ', simplify=TRUE)
  tokens2 <- c()
  len <- 0
  for(token in tokens){
    # Reset flag by the paragraph separator
    if(stri_detect_fixed(token, '|')){ 
      flag_quote <- FALSE 
      flag_bracket <- FALSE
    # Set flags
    flag_quote <- xor(flag_quote, stri_count_fixed(token, quote) == 1) # Exlcuded one-word quotaiton
    if(stri_detect_fixed(token, '(') != stri_detect_fixed(token, ')')){ 
      if(stri_detect_fixed(token, '(')) flag_bracket <- TRUE # Exlcuded one-word bracket
      if(stri_detect_fixed(token, ')')) flag_bracket <- FALSE # Exlcuded one-word bracket
    if(len < len_min){
      if(!stri_detect_fixed(token, '|')){
        tokens2 <- c(tokens2, token)
        len <- len + 1
      if(stri_detect_fixed(token, '|')){
        tokens2 <- c(tokens2, token)
        len <- 0 
      }else if(!flag_quote & !flag_bracket & stri_detect_regex(token, '([.!?])$')){
        tokens2 <- c(tokens2, token, '|') # Insert split mark
        len <- 0
        tokens2 <- c(tokens2, token)
        len <- len + 1
    #cat(token, flag_quote, flag_bracket, len, "\n")
  text2 <- paste(tokens2, collapse=' ')

Nexis news importer updated


I posted the code Nexis importer last year, but it tuned out that the HTML format of the database service is less consistent than I though, so I changed the logic. The new version is dependent less on the structure of the HTML files, but more on the format of the content.

library(XML) #might need libxml2-dev via apt-get command

readNewsDir <- function(dir,...){
  names <- list.files(dir, full.names = TRUE, recursive = TRUE)
  df <- data.frame(head = c(), body = c(), pub = c(), datetime = c(), edition = c(), length = c(), stringsAsFactors = FALSE)
  for(name in names){
    if(grepl('\\.html$|\\.htm$|\\.xhtml$', name, = TRUE)){
      df <- rbind(df, readNexisHTML(name, ...))

#readNexisHTML('/home/kohei/Documents/Syria report/nexis.html')
readNexisHTML <- function(name, sep = ' '){
  heads <- c()
  bodies <- c()
  bys <- c()
  pubs <- c()
  datetimes <- c()
  editions <- c()
  lengths <- c()
  #Convert format
  cat('Reading', name, '\n')
  # HTML cleaning------------------------------------------------
  lines <- scan(name, what="character", sep='\n', quiet=TRUE, encoding = "UTF-8")
  docnum <- 0
  for(i in 1:length(lines)){
    lines[i] <- gsub('<!-- Hide XML section from browser', '', lines[i])
    if(grepl('<DOC NUMBER=1>', lines[i])) docnum <- docnum + 1
    lines[i] <- gsub('<DOC NUMBER=1>', paste0('<DOC ID="doc_id_', docnum, '">'), lines[i])
    lines[i] <- gsub('<DOCFULL> -->', '<DOCFULL>', lines[i])
    lines[i] <- gsub('</DOC> -->', '</DOC>', lines[i])
  lines[i+1] = '' # Fix EOF problem
  html <- paste(lines, collapse='\n')
  # Write to debug
  #cat(html, file="converted.html", sep="", append=FALSE)
  # Main process------------------------------------------------
  #Load as DOM object
  doc <- htmlParse(html , encoding="UTF-8")
  # Remove index
  indexns <- getNodeSet(doc, '/html/body//doc[.//table]')
  for(indexn in indexns){
  for(node in getNodeSet(doc, '/html/body//doc')){
    pub <- NA
    datetime <- NA
    head <- NA
    by <- NA
    edition <- NA
    section <- NA
    length <- NA
    i <- 1
    for(div in getNodeSet(node, './/div')){
      value <- cleanNews(xmlValue(div))
      #print(paste(i, value))
      if(i == 1 & grepl('\\d+ of \\d+ DOCUMENTS', value)){
        i <- 2
      }else if(i == 2){
        #print(paste('pub', value))
        pub <- value
        i <- 3
      }else if(i == 3 & grepl('^(January|February|March|April|May|June|July|August|September|October|November|December)', value)){
        dateline <- value
        #print(paste('date', value))
        match <- regexpr(paste0('(January|February|March|April|May|June|July|August|September|October|November|December)',
                     '[, ]+([0-9]{1,2})',
                     '[, ]+([0-9]{4})',
                     '([,; ]+(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday))?',
                     '([, ]+(.+))?'), value, perl=TRUE)
        date <- c()
        for(j in 1:length(attr(match, "capture.start"))){
          from <- attr(match, "capture.start")[j]
          to <- attr(match, "capture.start")[j] + attr(match, "capture.length")[j]
          date <- c(date, substr(dateline, from, to))
        month <- gsub('[^a-zA-Z]', '', date[1])
        day <- gsub('[^0-9]', '', date[2])
        year <- gsub('[^0-9]', '', date[3])
        datetime <- format(strptime(paste(month, day, year, '12:00 AM'), 
                                   format='%B %d %Y %I:%M %p'), '%Y-%m-%d %H:%M:%S UTC')
        if(length(date) == 7){
          edition <- cleanNews(date[7])
        i <- 4
      }else if(i == 4 & !grepl('[A-Z]+:', value)){
        head <- value # Sometimes does not exists
        i <- 8
      }else if(i >= 4 & grepl('BYLINE:', value)){
        by <- sub('BYLINE: ', '', value)
        i <- 8
      }else if(i >= 4 & grepl('SECTION:', value)){
        section <- sub('SECTION: ', '', value)
        i <- 8
      }else if(i >= 4 & grepl('LENGTH:', value)){
        length <- strsplit(value, ' ')[[1]][2]
        i <- 8
      }else if(i >= 4 & grepl('[A-Z]+:', value)){
        i <- 8
      }else if(i == 8){
        paras <- c()
        for(p in getNodeSet(div, 'p')){ 
          paras <- c(paras, cleanNews(xmlValue(p)))
        if(length(paras) > 0){
          body <- paste(paras, sep = '', collapse=sep)
      heads <- c(heads, head)
      bodies <- c(bodies, body)
      bys <- c(bys, by)
      pubs <- c(pubs, pub)
      datetimes <- c(datetimes, datetime)
      editions <- c(editions, edition)
      lengths <- c(lengths, length)

  return(data.frame(head = as.character(heads), 
                    pub = as.character(pubs), 
                    datetime = as.POSIXct(datetimes, tz = 'UTC'),
                    by = as.factor(bys), 
                    edition = as.character(editions), 
                    length = as.numeric(lengths),
                    body = as.character(bodies), 
                    stringsAsFactors = FALSE))

cleanNews <- function(text){
  text <- gsub("\\r\\n|\\r|\\n|\\t", " ", text)
  text <- gsub("[[:cntrl:]]", " ", text, perl = TRUE)
  text <- gsub("\\s\\s+", " ", text)
  text <- gsub("^\\s+|\\s+$", "", text)

Geographical dictionary making technique


My new draft paper Newsmap: Dictionary expansion technique for geographical classification of very short longitudinal texts explains how to create a large geographical dictionary for text classification. Its algorithm is an updated version of the International Newsmap, and it is simpler and more statistically grounded. As I am arguing in the paper, this technique could be used to classify not only news stories, but social media posts.

International news coding instruction


It was already four years ago when I created my Newsmap. It is time to update the whole system: fully rewritten in Python and developing a new classification algorithm. This is why I generated a 5,000 human-coded international news stories using the Prolific Academic.

Thanks to the crowed-sourcing services, recruiting is no longer a problem, but we still have to provide coding instruction, and it has to be very clear and simple. The coding rules for my research is even-oriented, and international news stories were coding according to the location of the events or problems concerned.

Unlike traditional codebooks for content analysis, which are often long and complex, Newsmap coding instruction is only five pages, and comes with classification codes in a separate CSV file.

Crowd-coding of international news by Prolific Academic


I recently created a sizable human-coded dataset (5,000 items) of international news using the Prolific Academic service. The Prolific Academic is an Oxford-based academic alternative to the Amazon Mechanical Turk. The advantage of using this services is that researchers only have to compensate for work that they approve. The potential drawback is its relatively high costs. The service require researches to offer ‘ethical rewards’ to participants, and the minimum rate is £5. Most of the participants of the Prolific Academic are university students, but may be the same.

One of the reasons I had chosen the Prolific Academic over the Amazon Mechanical Turk was that classification of international news stories by the Turks may not be very accurate since the Americans are infamous for the lack of knowledge about foreign events.

The classification accuracy of the Prolific Academic participants in my project is shown below by country. Locations of participants (based on IP addresses) are concentrated in three countries, the UK, the US India, and the estimated accuracy (0-10) of the coding by the participants seems to be supporting my hypothesis: the Americans are not good in analyzing international news stories…

               accuracy   n percent
Austria        7.000000   1     0.3
Thailand       6.000000   4     1.3
Viet Nam       6.000000   4     1.3
United Kingdom 5.931250 160    51.8
Canada         5.666667   3     1.0
Spain          5.666667   9     2.9
Romania        5.600000   5     1.6
United States  5.192308  26     8.4
India          4.956989  93    30.1
Czech Republic 4.500000   2     0.6
Portugal       4.000000   1     0.3
Philippines    3.000000   1     0.3

The estimated accuracy of the US participants are much lower than UK counterparts. The low accuracy of the Indians participants seem to be due to their limited English language skills. Despite the prerequisite that English is the first language, the high hourly rate, which is very close to the minimum wage in the UK, attracted a lot of less qualified people. The Indians are only account for 2% of the registrants to the service, but it was 30% in this project.

I was expecting that the participants’ classification accuracy increases as they perform more tasks, but quite the opposite was the case. Some of the participants did really good jobs initially, but their classification accuracy usually decreased and sometimes became below 70%. The declining tendency in performance can be explained by participants’ attempt for cost minimization.

Those observations raise questions in crowd-sourced content analysis:

  1. Whether the Amazon Mechanical Turk is the always the best crowd-sourcing platform?
  2. Should we offer different amounts of reward to participants according to country of residence?
  3. How can we maintain or improve performance of participants over the course of projects?