Redefining word boundaries by collocation analysis

Standard

Quanteda’s tokenizer can segment Japanese and Chinese texts thanks to stringi, but its results are not always good, because its underlying function, ICU, recognizes only limited number of words. For example, this Japanese text

"ニューヨークのケネディ国際空港"

can be translated to “Kennedy International Airport (ケネディ国際空港) in (の) New York (ニューヨーク)”. Quanteda’s tokenizer (tokens function) segments this into too small pieces:

"ニュー"       "ヨーク"       "の"           "ケネディ"     "国際"         "空港"

Apparently, the first two words should not be separated. The standard Japanese POS tagger, Mecab, does just this:

"ニューヨーク" "の"           "ケネディ"     "国際"         "空港"

However, the erroneous segmentation can be corrected by running quaneda’s sequences function on a large corpus of news to identify contiguous collocations. After the correction of the word boundaries both the first (ニューヨーク) and last (国際空港) parts are joined together.

"ニューヨーク" "の"             "ケネディ"     "国際空港"

This is exactly the same approach to phrases and multi-word names in English texts. The process of word boundary correction is a series of collocation analysis and token concatenation. The data used to discover collocation comprises 138,108 news articles.

load('data_corpus_asahi_q10.RData')
toks <- tokens(corpus_segment(data_corpus_asahi_q10, what = "other", delimiter = "。"), include_docvars = TRUE)

toks <- tokens_select(toks, '^[0-9ぁ-んァ-ヶー一-龠]+$', valuetype = 'regex', padding = TRUE)

min_count <- 50

# process class of words that include 国際 and 空港
seqs_kanji <- sequences(toks, '^[一-龠]+$', valuetype = 'regex', nested = FALSE, 
                        min_count = min_count, ordered = FALSE) 
toks <- tokens_compound(toks, seqs_kanji[seqs_kanji$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

# process class of words that include ニュー and ヨーク
seqs_kana <- sequences(toks, '^[ァ-ヶー]+$', valuetype = 'regex', nested = FALSE, 
                       min_count = min_count, ordered = FALSE) 
toks <- tokens_compound(toks, seqs_kana[seqs_kana$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

# process both classes of words
seqs <- sequences(toks, '^[0-9ァ-ヶー一-龠]+$', valuetype = 'regex', nested = FALSE, 
                  min_count = min_count, ordered = FALSE)
toks <- tokens_compound(toks, seqs[seqs$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

saveRDS(toks, 'data_tokens_asahi.RDS')

Leave a Reply

Your email address will not be published. Required fields are marked *