Auditing POLTEXT 2019 in Tokyo

Sticky

We opened application for auditing POLTEXT 2019 that will take place at Waseda University in 14-15 September. We are very excited to have worldly famous keynote speakers, Jonathan Slapin (University of Zurich) and Sven-Oliver Proksch (University of Cologne), and over 60 presenters from all over the world.

If you are interested in attending, please signup on the conference website.

False European news sites

Standard

According to a news report, the European Union is stepping up its effort to prevent disinformation from spreading in collaboration with fact-checking organization in its member countries. They fear that foreign actors such as the Russian government to influence the EU parliament election later this month by spreading eurosceptic or anti-immigrant content.

Since 2017, I also have been developing a new methodology to detect sources of false news on the internet using a commercial web search API and quantitative text analysis. There were several technological problems that I need to solve, but I finally produced a result that I will present at the EPSA Annual Conference 2019, Belfast. The title of my paper is Searching for ‘submarine’ propaganda outlets: an efficient method to identify sources of false information on the web. I wrote the paper to show that we can find false news sites using these tools at low costs, but also found many websites that disguise themselves as news media in Europe.

For example, many of the websites in the table refers to European cities and countries in their names to make their news articles more trustworthy. Interestingly, all of the websites are hosted in a web server with the IP address 195.154.220.115. I could not find out who is behind these false European news sites, but I think these websites could be used for propaganda against European to influence political events like the EU parliament election.

Name Domain Place
Germany Latest News germanylatest.com Germany
France News 7 francenews7.com France
Fox News Online 24 foxnews24.info
Florida Report Daily floridareportdaily.com Florida
Extra London News extralondon.co.uk London
Extra American News extraamerican.com America
Evening Washington News eveningwashington.com Washington
Europe Brief News europebriefnews.com Europe
Edition Online editiononline.co.uk
Edition America editionamerica.com America
The Daily Cambridge dailycambridge.co.uk Cambridge
Current Affairs Online currentaffairsonline.co.uk
California Today californiatoday.net California
Brussels Morning brusselsmorning.com Brussels
Britain Today News britaintodaynews.com Britain
Britain Post News britainpost.co.uk Britain
Business News Report bnreport.com
Berlin Tomorrow berlintomorrow.com Berlin
Australia News Today australianews.today Australia
Amsterdam Times amsterdamtimes.info Amsterdam

You can list the website hosted on the web server using reverse IP lookup tool.

日本語の量的テキスト分析用の辞書

Standard

量的テキスト分析ではキーワード辞書が使われることが多いけれど、日本語では社会科学的な分析に用いられるものがほとんどなく、それが研究や教育における障害となっているように思います。でも最近、約15,000語が以下の23分野に分けられている日経シソーラスの存在を知人から教えてもらいました。

[1] "一般・共通"              "経済・産業"               "経営・企業"
[4] "農林水産"                "食品"                    "繊維・木材・紙パ"
[7] "資源・エネルギー"         "金属・土石"               "化学"
[10] "機械・器具・設備"        "電子電機"                 "情報・通信"
[13] "建設"                  "流通・サービス・家庭用品"   "環境・公害"
[16] "科学技術・文化"          "自然界"                  "国際"
[19] "政治"                  "地方"                    "労働・教育・医療"
[22] "社会・家庭"             "地域"

少なくとも新聞記事の分析では使えそうなので、語を集めてYAMLフォーマットにまとめてみました。単語版は、ウェブサイトに掲載されているままですが、複単語版はquantedatokens()で分かち書きをすることで、辞書分析や複単語の結合に使いやすくなっています。

このシソーラスを使う一番簡単な方法は、quanteda

dict <- dictionary(file = "nikkei-thesaurus_multiword.yml")
tokens_lookup(toks, dict)
tokens_compound(toks, dict)

のようにすることです。詳しい辞書の使い方については、Quanteda Tutorialsを参照してください。また、朝日新聞の『聞蔵』や読売新聞の『ヨミダス』から記事をダウンロードする場合は、newspapersを使うと簡単にテキストをRに読み込めます。

French and Chinese seed dictionaries are added to Newsmap

Standard

newsmap is a dictionary-based semi-supervised model for geographical document classification. The core of the package is not the machine learning algorithm but multi-lingual seed dictionaries created by me and other contributors in English, German, French, Spanish, Japanese, Russian, Chinese. We recently added Chinese (traditional and simplified) and French dictionaries, and submitted the package to CRAN.

The number of native speakers of these languages accounts for 30% of world population, which is actually much smaller than I though. Creation of Arabic, Hindi and Portuguese dictionaries will increase the population coverage by 12%, but there is a long way to go!

Measuring America’s historical threat perception

Standard

Last year, I wrote that the NYT API is a great source of historical anlaysis. Since then I have been working in a project with my colleagues at the LSE to create a historical index for America’s perceived threat. The project is coming to fruition, so I presented the latest results at the Waseda Data Week event with additional discussion on the history of international news production.

If you are interested in how we are measuring threat perception, please see my presentation slides.


POLTEXT is coming to Tokyo

Standard

I am organizing the POLTEXT symposium in Tokyo on 14-15 September, 2019. I have participated in the conference in 2016 (Croatia) as a presenter and in 2018 (Hungary) as a tutorial instructor, and learnt a lot from other participants. This is the time for me to offer such an opportunity people from inside and outside of Japan and to contribute to international development of quantitative text analysis. Please come to share your knowledge and experience with us. The call-for-papers is open until the end of March.

Computing document similarity in large corpus

Standard

Since early this year, I was asked by many people how to compute document (or feature) similarity in large corpus. They said their functions stops because the lack of space in RAM:

Error in .local(x, y, ...) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92 

This happened in our textstat_simil(margn = "documents") too because matrix multiplication in the function produces dense matrix with (ndoc(x) ^ 2) / 2 elements: the number of cells in the matrix is 5,000,000,000 if your corpus has 100,000 documents!

A solution to this problem is not recording values that is less than a certain threshold. You might be only interested in documents with cosine similarity larger than 0.9 when you study document reuse, for example. We upgraded our functions for document similarity computation, which is used in textstat_simil() and textstat_dist(), to achieve this in the latest Github version of quanteda. We also parallelized the computation in C++ to make it faster.

The new function is called textstat_proxy(). It is still experimental but has two new arguments, min_proxy and rank, to reduced the number of values to save storage. If you set the min_proxy, the function records only values larger than that; if you use rank, it records only top-n largest values for each document (or features).

Benchmarking on my Core i7 laptop showed that the new function is twice as fast as the old one. If either min_proxy or rank is used, it becomes 4 times faster.

library(quanteda)
mt <- readRDS("data_corpus_guardian_2016-2017.RDS") %>% 
      dfm(remove_punct = TRUE, remove = stopwords()) %>% 
      dfm_trim(min_termfreq = 10)
dim(mt)
# [1] 84599 83573 # 84599 documents

# subset the corpus because the it is too large for the old function
mt_sub <- dfm_sample(mt, 10000) 
dim(mt_sub)
# [1] 10000 83573 # 10000 documents

quanteda_options(threads = 8)
microbenchmark::microbenchmark(
    old = textstat_simil_old(mt_sub, method = "cosine"),  
    new = textstat_simil(mt_sub, method = "cosine"),  
    new_min = textstat_proxy(mt_sub, method = "cosine", min_proxy = 0.9),
    new_rank = textstat_proxy(mt_sub, method = "cosine", rank = 10),
    times = 10
)
# Unit: seconds
#      expr       min        lq      mean    median        uq       max neval
#       old 22.426574 22.641949 22.788590 22.745563 22.960467 23.160844    10
#       new 13.376352 13.417328 13.641411 13.638641 13.699010 14.226246    10
#   new_min  4.977046  5.010795  5.119516  5.114965  5.201249  5.314574    10
#  new_rank  5.303440  5.322976  5.411015  5.385124  5.482439  5.583506    10

More importantly, we can compute the document similarity between all the 84599 documents in the corpus without problems, if min_proxy is used. It took only 15 minutes on my laptop and the resulting object is as small as 12MB.

new_min <- textstat_proxy(mt, method = "cosine", min_proxy = 0.9)
print(object.size(new_min), units = "Mb")
# 12.4 Mb

If you want to know which documents are similar to which, you can make a simple conversion function and run.

matrix2list <- function(x) {
  names(x@x) <- rownames(x)[x@i + 1]
  split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))
}
simil <- matrix2list(new_min)
head(simil[lengths(simil) > 1])
# $text119559
# text119554 text119559 
#  0.9929825  1.0000000 
# 
# $text119561
# text119553 text119561 
#   0.994557   1.000000 
# 
# $text119562
# text119557 text119562 
#  0.9975438  1.0000000 
# 
# $text119564
# text119553 text119561 text119564 
#  0.9854428  0.9908825  1.0000000 
# 
# $text119568
# text119555 text119568 
#  0.9963637  1.0000000 
# 
# $text119570
# text119551 text119570 
#  0.9586148  1.0000000

textstat_proxy() has a great potential but it is still experimental, because we are not sure what the best format of the resulting objects. If you have any opinion, please post a comment on the GitHub page.

日本語の量的テキスト分析

Standard

より多くの日本人の研究者に量的テキスト分析について関心を持ってもらうために、『日本語の量的分析』という論文をニューヨーク大学のエイミー・カタリナックと一緒に書きました。これまでのところ、Twitterで多くの方からポジティブな反応を頂いています。

本稿は、欧米の政治学者の間で近年人気を集めている量的テキスト分析(quantitative text analysis)と呼ばれる手法の日本語における利用について論ずる。まず、量的テキスト分析が登場した背景を述べたうえで、欧米の政治学においてどのように利用されているかを説明する。次に、読者が量的テキスト分析を研究で利用できるように、日本語の分析において注意すべき点に言及しながら、作業の流れを具体的に説明する。最後に、欧米で利用されている統計分析モデルを紹介した上で、それらが日本語の文書の分析にも利用できることを研究事例を用いて示す。本稿は、近年の技術的および方法論な発展によって、日本語の量的テキスト分析が十分に可能になったことを主張するが、この手法が日本の政治学において広く普及するためには、データの整備など制度的な問題に対処していく必要性があることにも触れる。

Newsmap is available on CRAN

Standard

I am happy to announce that our R package for semi-supervised document classification, newsmap is available on CRAN. This package is simple in terms of algorithms but comes with well-maintained geographical seed dictionaries in English, German, Spanish, Russian and Japanese.

This package was created originally for geographical classification of news articles, but it can also be used for other tasks such as topic classification. For example, I performed two-dimensional classification in my latest paper on conspiracy theory using tentative topical seed words:

economy: [market*, money, bank*, stock*, bond*, industry, company, shop*]
politics: [parliament*, congress*, party leader*, party member*, voter*, lawmaker*, politician*]
society: [police, prison*, school*, hospital*]
diplomacy: [ambassador*, diplomat*, embassy, treaty]
military: [military, soldier*, air force, marine, navy, army]
nature: [water, wind, sand, forest, mountain, desert, animal, human]

In the mosaic plot, width and height of the columns show the proportions of counties and topics, respectively. Since the categories are per-defined here, it is much easier to interpret the result than in unsupervised topic classification by LDA.

By the way, if you want to produce the plot just pass a cross table of topic and country to mosaicplot()

top <- head(sort(table(data$country), decreasing = TRUE), 20)
tb <- table(data$country, data$topic)[names(top),]
mosaicplot(tb, border = 0, col = RColorBrewer::brewer.pal(6, "Set2"), main = "")