Reputation: 80
Here is a text with English words, CJK characters and fullwidth parenthesis(\uff08
and \uff09
):
这是(一段测试)文字(start开始end)的结果
I want to split the text into words, for CJK characters, one charcater is a word. The special point is that I also want the fullwidth left parenthesis \uff08
combines with the word after it, and the fullwidth right parenthesis \uff09
combines with the word before it.
The expected result will be:
这
是
(一
段
测
试)
文
字
(start
开
始
end)
的
结
果
Currently, I use new Regex(@"(\s+)|([\u0000-\u001F\u0021-\u007F]+)|([^\u0000-\u007F])");
to split the text, but fullwidth parentheses didn't combine with the word before/after it.
Upvotes: 1
Views: 66
Reputation: 147206
You can add those special cases:
(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f]+))
and
((?:[^\u0000-\u007F]|[\u0021-\u007f]+)\uff09)
to your regex, giving you a complete regex of:
(\s+)|(\uff08(?:[^\u0000-\u007F]|[\u0021-\u007f]+))|((?:[^\u0000-\u007F]|[\u0021-\u007f]+)\uff09)|([\u0000-\u001F\u0021-\u007F]+)|([^\u0000-\u007F])
Note they need to be added to the regex prior to the part of the regex that could match the word on its own, otherwise that match will take precedence.
Upvotes: 3