Python - regex with Japanese letters matches only one character

Question

I'm trying to find certain words in Japanese addresses so I can scrub them. If there is a single character, regex works fine, but they don't seem to find strings that are 2 characters or more:

import re
add = u"埼玉県川口市金山町１２丁目１－１０４番地"

test = re.search(ur'["番地"|"丁目"]',add)
print test.group(0)

丁

I can use re.findall instead of re.search, but it puts all of its findings into a tuple, so then I have to parse the tuple. If that's the best way to do it I can live with it but I figure I'm missing something.

In the example above, I want to swap "丁目" with a dash and remove the trailing "番地", so that the address reads thusly:

埼玉県川口市金山町１２－１－１０４

falsetru · Accepted Answer

You're using | inside the character classes ([....]). It will match any characters listed there; which is not what you want.

Specify the pattern without character classes. (also without ")

>>> import re
>>> add = u"埼玉県川口市金山町１２丁目１－１０４番地"
>>> test = re.search(ur'番地|丁目', add)
>>> test.group(0)
u'\u4e01\u76ee'
>>> print test.group(0)
丁目

To get what you want, use str.replace (unicode.repalce) and re.sub.

>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'－'))
埼玉県川口市金山町１２－１－１０４

$ is used to match only at the end of the string. If the position of 番地$ does not matter, regular expression is not needed. str.replace is enough:

>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'－'))
埼玉県川口市金山町１２－１－１０４

Python - regex with Japanese letters matches only one character

Answers (1)

Related Questions