Kishore_2021
Kishore_2021

Reputation: 689

Split string containing Chinese or Japanese or English into words

How can I split a string containing Chinese or Japanese or English into words by using regex or any utility class?

Example 1:

根據從2013年的一項研究,由一群來自美國俄亥俄州立大學的研

Output 1:

根 據 從 2013 年的 一 項研究,由 一 群來 自 美 國 俄 亥 俄 州 立 大 學 的 研

Example 2:

According to a 2013 study by a research group from the US to

Output 2:

According, to, a, 2013, study, by, a, research, group, from, the, US, to

It's certain that the input string will not mix English with Japanese - both will come in separate strings; but yes, an English string should also be split by this piece of code:

words = input.split("[ ./()\\[\\]=,<>;\"']+");

If this is not possible in Java, please suggest if the Non-English input strings could be separated by whitespace characters only.

Upvotes: 1

Views: 1866

Answers (2)

Nathan_Lee
Nathan_Lee

Reputation: 11

Example 1:

根據從2013年的一項研究,由一群來自美國俄亥俄州立大學的研

Output 1:

根 據 從 2013 年的 一 項研究,由 一 群來 自 美 國 俄 亥 俄 州 立 大 學 的 研

This is incorrect Chinese. The correct output should be:

根據 從 2013 年 的 一項 研究,由 一群 來自 美國 俄亥俄州 立 大學 的 研

You need a library for Chinese words to do this.

Upvotes: 1

dlu
dlu

Reputation: 791

I think the problem that you may have with Chinese (and maybe Japanese as well, but I don't know as much about it) is that the word breaks are contextual. Sometimes two characters will be two separate words, sometimes the same two characters will be a single word.

So I think you will need to parse the text to be able to do this.

Upvotes: 4

Related Questions