Sidal
Sidal

Reputation: 129

How to split CJK text into words?

I use JavaScript to create a transliteration. I am wondering whether it is possible to split CJK text into a sequence of words, defined according to some word segmentation standard. Any alternative?

Desired Behavior:

set: 动的密习近平

result: [动, 的, 密, 习, 近平]

Upvotes: 2

Views: 1245

Answers (2)

izogfif
izogfif

Reputation: 7495

As of 2023-07-12, Chromium and WebKit-based browsers support non-standard method expand of the Range object. This method can accept text string 'word' as an argument, but it only works on elements that are attached to the document. So you can try this:

(function () {
  const textToSplit = '动的密习近平'
  const words = []
  // Create a wrapper element
  const div = document.createElement('div')
  // Make it transparent so its contents can be selected but aren't visible
  // If you try to make it "display: none" or "visibility: hidden", 
  // it won't work.
  div.style.opacity = 0
  // Place the element at the start of the page to avoid messing up with
  // page layout.
  div.style.position = 'fixed'
  div.style.left = 0
  div.style.top = 0
  // Give browser a hint about content language inside wrapper element
  // Use 'zh-Hans' for Simplified Chinese and 'zh-Hant' for Traditional Chinese
  div.lang = 'zh-Hans'
  // Add element to the body, if you don't add it, it won't work
  document.body.appendChild(div)
  // Create a text node with your text
  const textNode = document.createTextNode(textToSplit)
  // Add it to the wrapper element 
  div.appendChild(textNode)
  // Create selection range
  const range = document.createRange()
  // Calculate maximum offset
  range.selectNodeContents(textNode)
  const maxOffset = range.endOffset
  // In a loop, move range after the end of the last known word
  // and call "expand('word')" method to make the range span the entire word
  for (let lastKnownWordEnd = 0;
       lastKnownWordEnd < maxOffset;
       lastKnownWordEnd = range.endOffset
  ) {
    range.setStart(textNode, lastKnownWordEnd)
    range.setEnd(textNode, lastKnownWordEnd)
    range.expand('word')
    words.push(range.toString())
  }
  // Do something with results
  console.log(words)
  // Clean up:
  div.remove()
})()

It's not always accurate: it treats "不知道" as a single word instead of splitting it into "不" and "知道".

If you need higher accuracy, you should use language-specific word segmenter libraries like ones mentioned in answers to this StackOverflow question.

Upvotes: 0

Ahmed Fasih
Ahmed Fasih

Reputation: 6927

To do this properly, people use machine learning, because, as you know, the challenge is that these languages (Chinese and Japanese at least) are written without spaces. There are some great tools that do this, in a few different programming languages:

  • Rakuten MA is for Chinese and Japanese and in JavaScript, and might be the best option for you.
  • MeCab is the granddaddy of Japanese parsers, in C++.
  • (KyTea is also in C++ and also for Japanese, but I haven't used it.)
  • Kuromoji is yet another one for Japanese, in Java.
  • Probably others I'm not aware of (sorry, I don't know anything about Korean parsers 😭, but doesn't Korean uses spaces?, so maybe that will be much easier).

Obviously to use the non-JavaScript tools in the browser, you'd need to run them on the backend (like Kuromoji does for powering their demo page). But even though you can run Rakuten MA in the browser, note that the browser will need to download a pretty large data file up front that the algorithm uses to parse text: see their demo page.

Another option might be to compile the C++ tools to JavaScript through Emscripten. I did this with MeCab (repo, demo page that also downloads a big data file up-front).

Note that all these tools do more than just parse text into words. It turns out they need to actually do morphological analysis and part-of-speech tagging in order to do accurate segmentation. So if you want "just" to split a sentence into words, be a bit prepared to wade through a lot of things you might not care about. But I just saw that your goal is transliteration, so maybe you are interested in that? MeCab/Kuromoji can tell you their guesses for words' pronunciations. Rakuten MA will only segment and tell you part-of-speech, it doesn't do transliteration (you'll have to look up the words in a dictionary, etc.).

There are also lighter-weight approaches than these, e.g., Japanese learners are familiar with Rikaichan Firefox extension (and Rikaikun and Rikaisama for other browsers), which I believe does a low-complexity parsing using just a dictionary and some rules. Rikaichan's source might be helpful to study? But if you need respectable, accurate results, this won't beat one of the above parsers.

Upvotes: 1

Related Questions