Redefining word boundaries by collocation analysis

Quanteda’s tokenizer can segment Japanese and Chinese texts thanks to stringi, but its results are not always good, because its underlying function, ICU, recognizes only limited number of words. For example, this Japanese text “ニューヨークのケネディ国際空港” can be translated to “Kennedy International Airport (ケネディ国際空港) in (の) New York (ニューヨーク)”. Quanteda’s tokenizer (tokens function) segments this into […]

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top