regex returning true with incorrect Japanese characters

Question

I am checking with a form, whether a postal code fitting the Japanese format has been input or not. I realized today that some information got through, even though it should never have "passed" the regex matching test.

Here is the regex :

".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"

It includes normal digits and also the Japanese ones ( same for the "-", the Japanese one "ー" can be input also ), the format should be something like : 123-4567.

When only latin letters and numbers are input, it works fine. But some Japanese characters that do not match at all ... are returned as match :

(Note : a match will return something, no match will return nothing. )

>>> import re
>>> regstr = ".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"

>>> re.match( regstr, "this is obviously not going to work")
>>> re.match( regstr, "this is going to work 123-4567")
<_sre.SRE_Match object at 0x7fced8b485d0>
>>> re.match( regstr, "this is going to work too １２３ー４５６７")
<_sre.SRE_Match object at 0x7fced8b48648>

>>> re.match( regstr, "This will not work, as it should not :  1234-567")
>>> re.match( regstr, "This should not work, but it does :  １２３４ー５６７")
<_sre.SRE_Match object at 0x7fced8b48648>
>>> re.match( regstr, "Now just seems crazy ....... 京都府")
<_sre.SRE_Match object at 0x7fced8b485d0>
>>> re.match( regstr, "京都府")
<_sre.SRE_Match object at 0x7fced8b48648>

>>> "京都府"
'\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c'
>>> re.match( regstr, "\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c")
<_sre.SRE_Match object at 0x7fced8b48648>

I tried inputing Chinese characters, and the couple of characters that I tried did not match.

So anybody living in Kyoto prefecture ... can "bypass" the regex it seems, as "京都府" is enough to make the whole string valid. Only two of those three characters will not work.

I tried with the unicode codes of those three characters, and it does match too ( I was wondering if the code could be used instead of the character itself for parsing the string, and wanted to make sure that it didn't contain something that could actually fit '000-0000'. It did not, but it still does match the regex ).

People living in Tokyo "東京府" will be 'less' lucky haha :

>>> re.match( regstr, "東京府")
>>> "東京府"
'\xe6\x9d\xb1\xe4\xba\xac\xe5\xba\x9c'

I checked there : https://regex101.com/ and those 3 characters do not

So ... I am pretty much lost here. With a more simple ".([0-9]{3}[-]{1}[0-9]{4})." as the regexp, it seem to be fine, but I really do not want to restrict users into inputing only [0-9-], as many will be inputing the Japanese versions ０１２３４５６７８９ー ( which are longer ). If it matters :

# 'Japanese numbers' code
>>> "０１２３４５６７８９ー"
'\xef\xbc\x90\xef\xbc\x91\xef\xbc\x92\xef\xbc\x93\xef\xbc\x94\xef\xbc\x95\xef\xbc\x96\xef\xbc\x97\xef\xbc\x98\xef\xbc\x99\xe3\x83\xbc'

I am going to just convert the Japanese ０１２３４５６７９８ー into 0123456789- for now, and apply a regex not containing Japanese characters at all, but ... I would really like to know what is going on with regex and Japanese characters.

If anybody has some clues, that would be well appreciated.

Cheers

edit : python 2.7

zvone · Accepted Answer

regstr = ".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"

In Python 3, regstr would be a unicode string containing some non-ascii characters. In Python 2, it is a string encoded in some encoding, which depends on what you declared at the beginning of the module (see PEP 263) and the encoding actually used to save the file. To avoid such problems, I suggest you should never use unicode characters in regex. That is just too difficult to debug. Escape them instead.

The characters ０１２３４５６７８９ are unicode characters '\uff10' to '\uff19', so I suggest you should use them as such.

Furthermore, If you are using unicode regex, you should define it as such, using the u prefix for unicode strings:

regstr = u".*([0-9\uff10-\uff19]{3}[-\u30fc]{1}[0-9\uff10-\uff19]{4}).*"

Later, when you are matching this regex to some string, that other string should also be a unicode string, rather than a plain str. For that, you have to know in which encoding the input is. For instance, if the input is utf-8, use:

input_string_as_unicode = unicode(input_string_as_utf8, 'utf-8')
re.match(regstr, input_string_as_unicode)

Note that you possibly already have the input as unicode, if there is some framework behind doing that for you. Check type(input_string) if you are not sure.

regex returning true with incorrect Japanese characters

Answers (2)

Related Questions