Reputation: 165
I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?
Upvotes: 2
Views: 1544
Reputation: 10512
Your regex will first check "="
against [^\u4e00-\u9fff0-9a-zA-Z]+
. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))
Upvotes: 1
Reputation: 626802
I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9]
with \d
and pass the re.UNICODE
flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)
- Group 1 capturing
[0-9]+
- 1+ digits [^\u4e00-\u9fff0-9a-zA-Z]+
- 1+ chars other than those defined in the specified ranges[0-9]+
- 1+ digits |
- or[^\u4e00-\u9fff0-9a-zA-Z]+
- 1+ chars other than those defined in the specified rangesIn Python 2.x, when a group is not matched in re.sub
, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.
Upvotes: 1