日本語の量的テキスト分析用の辞書

KoheiMarch 3, 2019December 22, 2019

量的テキスト分析ではキーワード辞書が使われることが多いけれど、日本語では社会科学的な分析に用いられるものがほとんどなく、それが研究や教育における障害となっているように思います。でも最近、約15,000語が以下の23分野に分けられている日経シソーラスの存在を知人から教えてもらいました。

[1] "一般・共通"              "経済・産業"               "経営・企業"
[4] "農林水産"                "食品"                    "繊維・木材・紙パ"
[7] "資源・エネルギー"         "金属・土石"               "化学"
[10] "機械・器具・設備"        "電子電機"                 "情報・通信"
[13] "建設"                  "流通・サービス・家庭用品"　  "環境・公害"
[16] "科学技術・文化"          "自然界"                  "国際"
[19] "政治"                  "地方"                    "労働・教育・医療"
[22] "社会・家庭"             "地域"

少なくとも新聞記事の分析では使えそうなので、語を集めてYAMLフォーマットにまとめてみました。単語版は、ウェブサイトに掲載されているままですが、複単語版はquantedaのtokens()で分かち書きをすることで、辞書分析や複単語の結合に使いやすくなっています。

このシソーラスを使う一番簡単な方法は、quantedaで

dict <- dictionary(file = "nikkei-thesaurus_multiword.yml")
tokens_lookup(toks, dict)
tokens_compound(toks, dict)

のようにすることです。詳しい辞書の使い方については、Quanteda Tutorialsを参照してください。また、朝日新聞の『聞蔵』や読売新聞の『ヨミダス』から記事をダウンロードする場合は、newspapersを使うと簡単にテキストをRに読み込めます。

Kohei

Posts created 113

Leave a Reply Cancel reply

Develop efficient custom functions using quanteda v4.0 – Kohei Watanabe on New tokens object in quanteda v4.0April 16, 2024
[…] most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows…
Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…

Back To Top