ken wang
ken wang

Reputation: 165

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the hyphen in 12-34 should be kept while the equal mark after 123 should be removed.

Here is my python script.

import re
s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)

the expected output should be

中国中国foo中国bar中123国中国12-34中国

but the result is

中国中国foo中国bar中123=国中国12-34中国

I can't figure out why there is an extra equal sign in the output?

Upvotes: 2

Views: 1544

Answers (2)

Pedro Castilho
Pedro Castilho

Reputation: 10512

Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.

You can try the following regex:

u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'

You can use it as such:

import re
s = "中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626802

I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.

In Python 2, it will look like

import re
s = u"中国,中,。》%国foo中¥国bar@中123=国%中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国

See the Python demo

If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.

The regex will look like

([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+

It will works like this:

  • ([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
    • [0-9]+ - 1+ digits
    • [^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
    • [0-9]+ - 1+ digits
  • | - or
  • [^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges

In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.

Upvotes: 1

Related Questions