Reputation: 689

Split string containing Chinese or Japanese or English into words

How can I split a string containing Chinese or Japanese or English into words by using regex or any utility class?

Example 1:

根據從2013年的一項研究，由一群來自美國俄亥俄州立大學的研

Output 1:

根據從 2013 年的一項研究，由一群來自美國俄亥俄州立大學的研

Example 2:

According to a 2013 study by a research group from the US to

Output 2:

According, to, a, 2013, study, by, a, research, group, from, the, US, to

It's certain that the input string will not mix English with Japanese - both will come in separate strings; but yes, an English string should also be split by this piece of code:

words = input.split("[ ./()\\[\\]=,<>;\"']+");

If this is not possible in Java, please suggest if the Non-English input strings could be separated by whitespace characters only.

Upvotes: 1

Answers (2)

Nathan_Lee

Reputation: 11

Example 1:

根據從2013年的一項研究，由一群來自美國俄亥俄州立大學的研

Output 1:

根據從 2013 年的一項研究，由一群來自美國俄亥俄州立大學的研

This is incorrect Chinese. The correct output should be:

根據從 2013 年的一項研究，由一群來自美國俄亥俄州立大學的研

You need a library for Chinese words to do this.

Upvotes: 1

dlu

Reputation: 791

I think the problem that you may have with Chinese (and maybe Japanese as well, but I don't know as much about it) is that the word breaks are contextual. Sometimes two characters will be two separate words, sometimes the same two characters will be a single word.

So I think you will need to parse the text to be able to do this.

Upvotes: 4

Split string containing Chinese or Japanese or English into words

Answers (2)

Related Questions