Reputation: 99488
^[一二三四五六七]、
doesn't match 一、
But ^一、
matches 一、
.
Is my way of specifying a character class of Chinese characters wrong?
I read the regular expression from a file.
Upvotes: 0
Views: 1749
Reputation: 11113
You need to make sure that you read the files using the correct encoding:
with open('my-regex-file', encoding='utf-8') as f:
regex = re.compile(f.read())
with open('my-text-file', encoding='utf-8') as f:
text = f.read()
if regex.match(text):
print("It's a match!")
Upvotes: 1
Reputation: 174776
Works for me,
>>> import re
>>> re.match(u'^[一二三四五六七]、', u'一、')
<_sre.SRE_Match object; span=(0, 2), match='一、'>
>>> re.match(u'^[一二三四五六七]、', u'一、').group(0)
'一、'
I think you failed to define your regex as unicode string.
In python3, it would be
# -*- coding: utf-8 -*-
import re
with open('file') as f:
reg = f.read().strip()
print(re.match(reg, u'一、').group(0))
Upvotes: 3