Dreampuf
Dreampuf

Reputation: 1181

How to filter chinese (ONLY chinese)

I want to convert some text that include some punctuation and full-width symbols to pure chinese text.

maybe_re = re.compile("xxxxxxxxxxxxxxxxx") #TODO
print "".join(maybe_re.findall("你好,这只是一些中文文本..,.,全角"))

# I want out
你好这只是一些中文文本全角

Upvotes: 7

Views: 3147

Answers (3)

Andj
Andj

Reputation: 1437

An old question, but for future reference: the regex module, unlike the re module, supports Unicode regex patterns for scripts.

It is sufficient for the question's purpose to match only Han ideographs. \p{script=Han} would match any Han ideograph. \p{isHan}, \p{sc=Han} and \p{Han} are abbreviated forms of the pattern.

import regex as re
s = "你好,这只是一些中文文本..,.,全角"
print("".join(re.findall(r'\p{Han}', s)))
# 你好这只是一些中文文本全角

Upvotes: 0

Régis B.
Régis B.

Reputation: 10618

The Zhon library provides you with a list of Chinese punctuation marks: https://pypi.python.org/pypi/zhon

str = re.sub('[%s]' % zhon.unicode.PUNCTUATION, "", "你好,这只是一些中文文本..,.,全角")

This does almost what you want. Not exactly, because the sentence you provide contains some very non-standard punctuation marks, such as ".". Anyway, I think Zhon might be useful to others with a similar issue.

Upvotes: 4

Thomas K
Thomas K

Reputation: 40370

I don't know of any good way to separate Chinese characters from other letters, but you can distinguish letters from other characters. Using regexes, you can use r"\w" (compiled with the re.UNICODE flag if you're on Python 2). That will include numbers as well as letters, but not punctuation.

unicodedata.category(c) will tell you what type of character c is. Your Chinese letters are "Lo" (letter without case), while the punctuation is "Po".

Upvotes: 4

Related Questions