Tim
Tim

Reputation: 99488

Unicode regex to match a character class of Chinese characters

^[一二三四五六七]、 doesn't match 一、

But ^一、 matches 一、.

Is my way of specifying a character class of Chinese characters wrong?

I read the regular expression from a file.

Upvotes: 0

Views: 1749

Answers (2)

Raniz
Raniz

Reputation: 11113

You need to make sure that you read the files using the correct encoding:

with open('my-regex-file', encoding='utf-8') as f:
    regex = re.compile(f.read())
with open('my-text-file', encoding='utf-8') as f:
    text = f.read()
if regex.match(text):
    print("It's a match!")

Upvotes: 1

Avinash Raj
Avinash Raj

Reputation: 174776

Works for me,

>>> import re
>>> re.match(u'^[一二三四五六七]、', u'一、')
<_sre.SRE_Match object; span=(0, 2), match='一、'>
>>> re.match(u'^[一二三四五六七]、', u'一、').group(0)
'一、'

I think you failed to define your regex as unicode string.

In python3, it would be

# -*- coding: utf-8 -*-

import re

with open('file') as f:
    reg = f.read().strip()
    print(re.match(reg, u'一、').group(0))

Upvotes: 3

Related Questions