dashtaisen
dashtaisen

Reputation: 75

Python regex unexpectedly replacing Chinese characters

I have a list of Chinese dictionary entries (based on cc-cedict) that contains a mix of Chinese and latin characters in the following format, separated by newlines:

(source.txt)

traditional_chars simplified_chars, pinyin, definition

山牆 山墙,shan1 qiang2,gable

B型超聲 B型超声, B xing2 chao1 sheng1,type-B ultrasound

I'd like to put a comma between the traditional and simplified characters:

(Desired result)

山牆,山墙,shan1 qiang2,gable

B型超聲,B型超声, B xing2 chao1 sheng1,type-B ultrasound

After some experimenting in regex101, I came up with this pattern:

[A-z]*[\u4300-\u9fff]+(\s)[A-z]*[\u4300-\u9fff]+,

I tried to apply this pattern in Python with the following code:

import re
sourcepath = 'sourcefile.txt'
destpath = 'result.txt'
pattern = '[A-z]*[\u4300-\u9fff]+(\s)[A-z]*[\u4300-\u9fff]+,'

source = open(sourcepath, 'r').read()
dest = open(destpath, 'w')
result = re.sub(pattern, ',', source)
dest.write(result)
dest.close()

But when I open result.txt, the result I get is not what I expected:

,shan1 qiang2,gable

, B xing2 chao1 sheng1,type-B ultrasound

I also tried using the regexp module with this pattern:

[A-z]*\p{Han}(\s)[A-z]*\p{Han}

But the result was the same.

I thought that by putting the \s character in parentheses, that it would make a capture group, and only that space would be replaced. But it looks like the Chinese characters are getting replaced too. Did I make a mistake in the regular expression, the code, or both? How should I change it to get the desired result?

Upvotes: 2

Views: 1269

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

In case you have odd number of Chinese "words", your pattern should account for overlapping matches. Use lookaheads:

re.sub(r'(?i)[A-Z]*[\u4300-\u9fff]+(?=\s+[A-Z]*[\u4300-\u9fff]+)', r'\g<0>,', source)
                                   ^^^                         ^

Or use an atomic group emulation with capturing inside a positive lookahead combined with the backreference in the consuming pattern and a lookahead checking if there is a comma already:

re.sub(r'(?i)[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', r'\g<0>,', source) 
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 

See the regex demo (and demo 2) - do not pay attention to the \x{} notation, it is only for demo since I am using the PHP option).

See the IDEONE Python 3 demo:

import re
p = re.compile(r'[A-Z]*(?=([\u4300-\u9fff]+))\1(?!,)', re.IGNORECASE | re.U)
test_str = "山牆 山墙,shan1 qiang2,gable\nB型超聲 B型超声, B xing2 chao1 sheng1,type-B ultrasound"
result = p.sub(r"\g<0>,", test_str)
print(result)
# => 山牆, 山墙,shan1 qiang2,gable
# => B型超聲, B型超声, B xing2 chao1 sheng1,type-B ultrasound

Upvotes: 1

Pedro Lobito
Pedro Lobito

Reputation: 98961

Tested on Python 3.5 with your sample code:

result = re.sub(r"([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)", r"\1,\2", subject, 0, re.IGNORECASE)

Regex Explanation

([\u4e00-\u9fff]+)\s+(?:[a-z]+)?([\u4e00-\u9fff]+)

Options: Case insensitive; Regex syntax only

Match the regex below and capture its match into backreference number 1 «([\u4e00-\u9fff]+)»
   Match a single character in the range between these two characters «[\u4e00-\u9fff]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      The character “一” which occupies Unicode code point U+4E00 «\u4e00»
      The Unicode character with code point U+9FFF «\u9fff»
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) «\s+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regular expression below «(?:[a-z]+)?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
   Match a single character in the range between “a” and “z” (case insensitive) «[a-z]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 2 «([\u4e00-\u9fff]+)»
   Match a single character in the range between these two characters «[\u4e00-\u9fff]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      The character “一” which occupies Unicode code point U+4E00 «\u4e00»
      The Unicode character with code point U+9FFF «\u9fff»

\1,\2

Insert the text that was last matched by capturing group number 1 «\1»
Insert the character string “,” literally «,»
Insert the text that was last matched by capturing group number 2 «\2»

Upvotes: 0

Daniel Martin
Daniel Martin

Reputation: 23548

I thought that by putting the \s character in parentheses, that it would make a capture group, and only that space would be replaced.

That's not how capturing groups work. Everything matched still gets replaced, but with a capturing group, you can refer to pieces of what got matched in the replacement.

I'd change two lines of your script:

pattern = '(?i)([a-z]*[\u4300-\u9fff]+)\s([a-z]*[\u4300-\u9fff]+)'

And

result = re.sub(pattern, '\g<0>,\g<1>', source)

Upvotes: 0

Related Questions