Tensigh
Tensigh

Reputation: 1050

Python - regex with Japanese letters matches only one character

I'm trying to find certain words in Japanese addresses so I can scrub them. If there is a single character, regex works fine, but they don't seem to find strings that are 2 characters or more:

import re
add = u"埼玉県川口市金山町12丁目1-104番地"

test = re.search(ur'["番地"|"丁目"]',add)
print test.group(0)

丁

I can use re.findall instead of re.search, but it puts all of its findings into a tuple, so then I have to parse the tuple. If that's the best way to do it I can live with it but I figure I'm missing something.

In the example above, I want to swap "丁目" with a dash and remove the trailing "番地", so that the address reads thusly:

埼玉県川口市金山町12-1-104

Upvotes: 1

Views: 1084

Answers (1)

falsetru
falsetru

Reputation: 369274

You're using | inside the character classes ([....]). It will match any characters listed there; which is not what you want.

Specify the pattern without character classes. (also without ")

>>> import re
>>> add = u"埼玉県川口市金山町12丁目1-104番地"
>>> test = re.search(ur'番地|丁目', add)
>>> test.group(0)
u'\u4e01\u76ee'
>>> print test.group(0)
丁目

To get what you want, use str.replace (unicode.repalce) and re.sub.

>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-'))
埼玉県川口市金山町12-1-104

$ is used to match only at the end of the string. If the position of 番地$ does not matter, regular expression is not needed. str.replace is enough:

>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-'))
埼玉県川口市金山町12-1-104

Upvotes: 5

Related Questions