Reputation: 129
I use JavaScript to create a transliteration. I am wondering whether it is possible to split CJK text into a sequence of words, defined according to some word segmentation standard. Any alternative?
Desired Behavior:
set: 动的密习近平
result: [动, 的, 密, 习, 近平]
Upvotes: 2
Views: 1245
Reputation: 7495
As of 2023-07-12, Chromium and WebKit-based browsers support non-standard method expand
of the Range
object. This method can accept text string 'word'
as an argument, but it only works on elements that are attached to the document.
So you can try this:
(function () {
const textToSplit = '动的密习近平'
const words = []
// Create a wrapper element
const div = document.createElement('div')
// Make it transparent so its contents can be selected but aren't visible
// If you try to make it "display: none" or "visibility: hidden",
// it won't work.
div.style.opacity = 0
// Place the element at the start of the page to avoid messing up with
// page layout.
div.style.position = 'fixed'
div.style.left = 0
div.style.top = 0
// Give browser a hint about content language inside wrapper element
// Use 'zh-Hans' for Simplified Chinese and 'zh-Hant' for Traditional Chinese
div.lang = 'zh-Hans'
// Add element to the body, if you don't add it, it won't work
document.body.appendChild(div)
// Create a text node with your text
const textNode = document.createTextNode(textToSplit)
// Add it to the wrapper element
div.appendChild(textNode)
// Create selection range
const range = document.createRange()
// Calculate maximum offset
range.selectNodeContents(textNode)
const maxOffset = range.endOffset
// In a loop, move range after the end of the last known word
// and call "expand('word')" method to make the range span the entire word
for (let lastKnownWordEnd = 0;
lastKnownWordEnd < maxOffset;
lastKnownWordEnd = range.endOffset
) {
range.setStart(textNode, lastKnownWordEnd)
range.setEnd(textNode, lastKnownWordEnd)
range.expand('word')
words.push(range.toString())
}
// Do something with results
console.log(words)
// Clean up:
div.remove()
})()
It's not always accurate: it treats "不知道" as a single word instead of splitting it into "不" and "知道".
If you need higher accuracy, you should use language-specific word segmenter libraries like ones mentioned in answers to this StackOverflow question.
Upvotes: 0
Reputation: 6927
To do this properly, people use machine learning, because, as you know, the challenge is that these languages (Chinese and Japanese at least) are written without spaces. There are some great tools that do this, in a few different programming languages:
Obviously to use the non-JavaScript tools in the browser, you'd need to run them on the backend (like Kuromoji does for powering their demo page). But even though you can run Rakuten MA in the browser, note that the browser will need to download a pretty large data file up front that the algorithm uses to parse text: see their demo page.
Another option might be to compile the C++ tools to JavaScript through Emscripten. I did this with MeCab (repo, demo page that also downloads a big data file up-front).
Note that all these tools do more than just parse text into words. It turns out they need to actually do morphological analysis and part-of-speech tagging in order to do accurate segmentation. So if you want "just" to split a sentence into words, be a bit prepared to wade through a lot of things you might not care about. But I just saw that your goal is transliteration, so maybe you are interested in that? MeCab/Kuromoji can tell you their guesses for words' pronunciations. Rakuten MA will only segment and tell you part-of-speech, it doesn't do transliteration (you'll have to look up the words in a dictionary, etc.).
There are also lighter-weight approaches than these, e.g., Japanese learners are familiar with Rikaichan Firefox extension (and Rikaikun and Rikaisama for other browsers), which I believe does a low-complexity parsing using just a dictionary and some rules. Rikaichan's source might be helpful to study? But if you need respectable, accurate results, this won't beat one of the above parsers.
Upvotes: 1