Reputation: 1050
I'm trying to find certain words in Japanese addresses so I can scrub them. If there is a single character, regex works fine, but they don't seem to find strings that are 2 characters or more:
import re
add = u"埼玉県川口市金山町12丁目1-104番地"
test = re.search(ur'["番地"|"丁目"]',add)
print test.group(0)
丁
I can use re.findall
instead of re.search
, but it puts all of its findings into a tuple, so then I have to parse the tuple. If that's the best way to do it I can live with it but I figure I'm missing something.
In the example above, I want to swap "丁目" with a dash and remove the trailing "番地", so that the address reads thusly:
埼玉県川口市金山町12-1-104
Upvotes: 1
Views: 1084
Reputation: 369274
You're using |
inside the character classes ([....]
). It will match any characters listed there; which is not what you want.
Specify the pattern without character classes. (also without "
)
>>> import re
>>> add = u"埼玉県川口市金山町12丁目1-104番地"
>>> test = re.search(ur'番地|丁目', add)
>>> test.group(0)
u'\u4e01\u76ee'
>>> print test.group(0)
丁目
To get what you want, use str.replace
(unicode.repalce
) and re.sub
.
>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-'))
埼玉県川口市金山町12-1-104
$
is used to match only at the end of the string. If the position of 番地$
does not matter, regular expression is not needed. str.replace
is enough:
>>> print re.sub(u'番地$', u'', add.replace(u'丁目', u'-'))
埼玉県川口市金山町12-1-104
Upvotes: 5