Text analysis – Page 4 – Kohei Watanabe

Text analysisNovember 2, 2020November 2, 2020

Study political and economic changes with semisupervided text analysis methods

Earlier this year, I have published my first paper on semisupervised methods (Newsmap and seeded LDA) in Social Science Computer Review. My second paper on semisupervised method (Latent Semantic Scaling) has appeared in Communication Methods and Measures a few days ago. I wrote these research articles and developed software packages as part of my effort […]

Programing, Text analysisSeptember 14, 2020September 15, 2020

Uploaded two new semisupervised models to CRAN

In this summer, I have submitted two packages for quantitative text analysis to CRAN: seededlda and LSX. These packages have been available in my Github repositories but I though it is time to make them more readily available to promote semisupervised machine learning techniques. seededlda is a package that implements seeded-LDA using the GibbsLDA++ library. […]

Text analysisJune 27, 2020June 27, 2020

Quanteda and semisupervised models

I and my co-developers received the 2020 Statistical Software Award from the Society for Political Methodology for quanteda‘s contribution to research. The package has established the reputation as user-friendly and highly-efficient R package for quantitative text analysis in the political scientist community. I also know that there are many users of the package in other […]

Text analysisApril 8, 2020November 21, 2020

New paper on Latent Semantic Scaling

I developed Latent Semantic Scaling (LSS) to perform sentiment analysis of news articles about the Ukraine crisis in my PhD project in London. LSS only requires a small set of polarity words, called “seed words”, to perform large-scale document scaling about a specific subject, becasue it automatically identify synonyms of seed words by latent semantic […]

Text analysisMarch 25, 2020March 25, 2020

New stopwords collection for European and Asian languages

In quantitative text analysis, it is common to remove grammatical elements using stopword lists defined in Snowball, but it does not contain stopword for Asian languages. The lack of stopwords collection that cover both European and Asian-languages made cross-lingual analysis difficult. To solve this problem, I and my collaborators created a new stopwords collection, called […]

Japanese, Text analysisMarch 25, 2020March 25, 2020

日本語のストップワーズ

量的テキスト分析では、文法的な要素である機能語を前処理で削除することが一般的で、英語などのヨーロッパ言語にはSnowballで定義されたリストが広く使われています。しかし、Snowballは日本語などのアジア言語を含まないため、ヨーロッパ言語とアジア言語での比較分析を行う場合に適切なリストがありませんでした。この問題を解決するために、Snowballの英語のリストを拡張および翻訳し、ヨーロッパ言語とアジア言語の両方に適用できるMarimoという新しいストップワーズのコレクションを作成しました。このコレクションは、現時点では、英語、ドイツ語、日本語、アラビア語、ヘブライ語だけを含んでいますが、これから言語を増やしていく予定です。 Marimoの特徴は、ストップワーズが種類ごとに階層化されていることです。これは、語の役割を特定することで翻訳を容易にする、そして、余分な語を容易に排除できるようにするためです。例えば、reportingやtime、numberなどのカテゴリーは新聞記事の分析のために追加したものですが、別の種類の文書では必要がないでしょう。日本語の文書には、一文字のひらがなから構成されるトークンが大量に含まれますが、それらは正規表現で容易に削除できるため、リストには含めず、メンテナンスを容易にしてあります。 MarimoのYAMLファイルはquantedaパッケージのdictionary()で容易にRに読み込めます。さらに、これらのリストをstopwordsパッケージを通じて利用できるようにする予定です。

Publication, Text analysisJanuary 29, 2020February 21, 2020

New research paper on how to choose seed words for semi-supervised models

I have been developing and applying semi-supervised models, such as seeded-LDA, Newsmap and LSS, for classification and document scaling aiming to broader the scope of quantitative text analysis in recent years. These models are very cost efficient because they only require a small set of “seed words” to learn categories or dimensions of interest. However, […]

Programing, Text analysisDecember 25, 2019January 19, 2020

Why quanteda is so fast?

Those who read my recent post on quanteda’s performance might wonder why the package is so fast. It is not only because we carefully wrote R code for the package but also optimized internal functions and objects for large textual data. There are three design features of quanteda that dramatically enhanced its performance. Upfront data […]

Programing, Text analysisDecember 19, 2019January 19, 2020

R and Python text analysis packages performance comparison – updated

I compared the performance of R and Python in 2017 when we were developing quanteda v1.0, and confirmed that our package’s execution time is around 50% shorter, and peak memory consumption is 40% smaller than gensim. After two years, we are developing quanteda v2.0, which will be released early next year. We are improving the […]

Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…
Kohei on Good and bad methods to extract context wordsJuly 10, 2023
When you measure sentiment on economy, the sentiment of its modifiers like "booming" or "stagnant" is the most important. "economy"…