Segmentation of Japanese or Chinese texts by stringi

Standard

Use of POS tagger like Mecab and Chasen is considered necessary for segmentation of Japanese texts because words are not separated by spaces like European languages, but I recently learned this is not always the case. When I was testing quanteda‘s tokenization function, I passed a Japanese text to it without much expectation, but the result was very interesting. Japanese words were segmented very nicely! The output was as actual as one from Mecab.

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokens(txt_jp)
tokens from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"           "な"          
 [9] "影響"         "を"           "及"           "ぼ"           "し"           "、"           "社会"         "で"          
[17] "生きる"       "ひとりひとり" "の"           "人"           "の"           "人生"         "に"           "も"          
[25] "様々"         "な"           "影響"         "を"           "及ぼす"       "複雑"         "な"           "領域"        
[33] "で"           "ある"         "。"         

Also, quanteda’s tokenzer segmented Chinese texts very well, but it should not work like this, because there is no POS tagger in the package.

> txt_cn <- "政治是各种团體进行集体决策的一个过程,也是各种团體或个人为了各自的領域所结成的特定关系,尤指對於某一政治實體的統治,例如統治一個國家,亦指對於一國內外事務之監督與管制。"
> quanteda::tokens(txt_cn)
tokens from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"   ","     "也是"  
[14] "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"   "所"     "结成"   "的"     "特定"  
[27] "关系"   ","     "尤"     "指"     "對於"   "某一"   "政治"   "實體"   "的"     "統治"   ","     "例如"   "統治"  
[40] "一個"   "國家"   ","     "亦"     "指"     "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"  
[53] "。"  

The answer to this mystery was found in stringi::stri_split_boundaries, which is the underlying function of quanteda’s tokenizer. stri_split_boundaries utilizes a library called ICU (International Components for Unicode) and the library uses dictionaries for segmentation of texts in Chinese, Japanese, Thai or Khmer. The Japanese dictionary is actually a IPA dictionary, which Mecab also depends on.

This means that those who perform bag-of-words text analysis of Chinese, Japanese, Thai or Khmer texts no longer need to install POS tagger for word segmentation. This would be a massive boost of social scientific text analysis in those languages!

Leave a Reply

Your email address will not be published. Required fields are marked *