正規表現による「っ」を含むトークンの修正

KoheiOctober 28, 2019January 19, 2020

quantedaのICUに基づく日本語の分かち書きはだいたいうまく行くけれど、「持った」「言った」「踊った」などの「っ」を含む文は苦手なようです。

> txt <- "持ってくると言った覚えはない"
> toks <- tokens(txt)
> print(toks)
tokens from 1 document.
text1 :
 [1] "持"   "って" "くる" "と"   "言"   "っ"   "た"   "覚え" "は"   "ない"

以下のMecabの形態素解析によれば「持っ」と「言っ」となるべきですが、ICUだと「って」と「っ」という意味をなさないトークンが生成されてしまいます。

持っ	モッ	モツ	持つ	動詞-一般	五段-タ行	連用形-促音便
て	テ	テ	て	助詞-接続助詞		
くる	クル	クル	来る	動詞-非自立可能	カ行変格	終止形-一般
と	ト	ト	と	助詞-格助詞		
言っ	イッ	イウ	言う	動詞-一般	五段-ワア行	連用形-促音便
た	タ	タ	た	助動詞	助動詞-タ	連体形-一般
覚え	オボエ	オボエ	覚え	名詞-普通名詞-一般		
は	ワ	ハ	は	助詞-係助詞		
ない	ナイ	ナイ	無い	形容詞-非自立可能	形容詞	終止形-一般

そこで思いついたのが、 tokens_compound()とtokens_split()を使ってトークンを修正する方法です。前者は昔からある関数ですが、後者は比較的新しい関数で、前者の反対の処理をします。この方法だと、まず、 tokens_split()で「っ」を単体のトークンとし、 tokens_compound() で前に出てくる漢字のトークンと結合します。結果として、Mecabによる分かち書きと同一なトークンを得ることができました。

> toks <- tokens_split(toks, "っ", valuetype = "fixed", remove_separator = FALSE)
> print(toks)
tokens from 1 document.
text1 :
 [1] "持"   "っ"   "て"   "くる" "と"   "言"   "っ"   "た"   "覚え" "は"   "ない"

> toks <- tokens_compound(toks, list(c("^[一-龠]$", "^っ$")), valuetype = "regex", concatenator = "")
> print(toks)
tokens from 1 document.
text1 :
[1] "持っ" "て"   "くる" "と"   "言っ" "た"   "覚え" "は"   "ない"

この方法だと、 tokens_split() が「っ」を含むけれど、関係のないトークンを破壊する恐れがありますが、だいたいの文書では問題にならないでしょう。また、この二つの関数は、C++で並列化してあるので、処理速度も早いと思います。

Kohei

Posts created 113

Leave a Reply Cancel reply

Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…
Kohei on Good and bad methods to extract context wordsJuly 10, 2023
When you measure sentiment on economy, the sentiment of its modifiers like "booming" or "stagnant" is the most important. "economy"…

Back To Top