Reputation: 689
How can I split a string containing Chinese or Japanese or English into words by using regex or any utility class?
Example 1:
根據從2013年的一項研究,由一群來自美國俄亥俄州立大學的研
Output 1:
根 據 從 2013 年的 一 項研究,由 一 群來 自 美 國 俄 亥 俄 州 立 大 學 的 研
Example 2:
According to a 2013 study by a research group from the US to
Output 2:
According, to, a, 2013, study, by, a, research, group, from, the, US, to
It's certain that the input string will not mix English with Japanese - both will come in separate strings; but yes, an English string should also be split by this piece of code:
words = input.split("[ ./()\\[\\]=,<>;\"']+");
If this is not possible in Java, please suggest if the Non-English input strings could be separated by whitespace characters only.
Upvotes: 1
Views: 1866
Reputation: 11
Example 1:
根據從2013年的一項研究,由一群來自美國俄亥俄州立大學的研
Output 1:
根 據 從 2013 年的 一 項研究,由 一 群來 自 美 國 俄 亥 俄 州 立 大 學 的 研
This is incorrect Chinese. The correct output should be:
根據 從 2013 年 的 一項 研究,由 一群 來自 美國 俄亥俄州 立 大學 的 研
You need a library for Chinese words to do this.
Upvotes: 1
Reputation: 791
I think the problem that you may have with Chinese (and maybe Japanese as well, but I don't know as much about it) is that the word breaks are contextual. Sometimes two characters will be two separate words, sometimes the same two characters will be a single word.
So I think you will need to parse the text to be able to do this.
Upvotes: 4