FlyingPikachu
FlyingPikachu

Reputation: 135

regex returning true with incorrect Japanese characters

I am checking with a form, whether a postal code fitting the Japanese format has been input or not. I realized today that some information got through, even though it should never have "passed" the regex matching test.

Here is the regex :

".*([0-90-9]{3}[-ー]{1}[0-90-9]{4}).*"

It includes normal digits and also the Japanese ones ( same for the "-", the Japanese one "ー" can be input also ), the format should be something like : 123-4567.

When only latin letters and numbers are input, it works fine. But some Japanese characters that do not match at all ... are returned as match :

(Note : a match will return something, no match will return nothing. )

>>> import re
>>> regstr = ".*([0-90-9]{3}[-ー]{1}[0-90-9]{4}).*"

>>> re.match( regstr, "this is obviously not going to work")
>>> re.match( regstr, "this is going to work 123-4567")
<_sre.SRE_Match object at 0x7fced8b485d0>
>>> re.match( regstr, "this is going to work too 123ー4567")
<_sre.SRE_Match object at 0x7fced8b48648>

>>> re.match( regstr, "This will not work, as it should not :  1234-567")
>>> re.match( regstr, "This should not work, but it does :  1234ー567")
<_sre.SRE_Match object at 0x7fced8b48648>
>>> re.match( regstr, "Now just seems crazy ....... 京都府")
<_sre.SRE_Match object at 0x7fced8b485d0>
>>> re.match( regstr, "京都府")
<_sre.SRE_Match object at 0x7fced8b48648>

>>> "京都府"
'\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c'
>>> re.match( regstr, "\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c")
<_sre.SRE_Match object at 0x7fced8b48648>

I tried inputing Chinese characters, and the couple of characters that I tried did not match.

So anybody living in Kyoto prefecture ... can "bypass" the regex it seems, as "京都府" is enough to make the whole string valid. Only two of those three characters will not work.

I tried with the unicode codes of those three characters, and it does match too ( I was wondering if the code could be used instead of the character itself for parsing the string, and wanted to make sure that it didn't contain something that could actually fit '000-0000'. It did not, but it still does match the regex ).

People living in Tokyo "東京府" will be 'less' lucky haha :

>>> re.match( regstr, "東京府")
>>> "東京府"
'\xe6\x9d\xb1\xe4\xba\xac\xe5\xba\x9c'

I checked there : https://regex101.com/ and those 3 characters do not

So ... I am pretty much lost here. With a more simple ".([0-9]{3}[-]{1}[0-9]{4})." as the regexp, it seem to be fine, but I really do not want to restrict users into inputing only [0-9-], as many will be inputing the Japanese versions 0123456789ー ( which are longer ). If it matters :

# 'Japanese numbers' code
>>> "0123456789ー"
'\xef\xbc\x90\xef\xbc\x91\xef\xbc\x92\xef\xbc\x93\xef\xbc\x94\xef\xbc\x95\xef\xbc\x96\xef\xbc\x97\xef\xbc\x98\xef\xbc\x99\xe3\x83\xbc'

I am going to just convert the Japanese 0123456798ー into 0123456789- for now, and apply a regex not containing Japanese characters at all, but ... I would really like to know what is going on with regex and Japanese characters.

If anybody has some clues, that would be well appreciated.

Cheers

edit : python 2.7

Upvotes: 1

Views: 403

Answers (2)

zvone
zvone

Reputation: 19362

regstr = ".*([0-90-9]{3}[-ー]{1}[0-90-9]{4}).*"

In Python 3, regstr would be a unicode string containing some non-ascii characters. In Python 2, it is a string encoded in some encoding, which depends on what you declared at the beginning of the module (see PEP 263) and the encoding actually used to save the file. To avoid such problems, I suggest you should never use unicode characters in regex. That is just too difficult to debug. Escape them instead.

The characters 0123456789 are unicode characters '\uff10' to '\uff19', so I suggest you should use them as such.

Furthermore, If you are using unicode regex, you should define it as such, using the u prefix for unicode strings:

regstr = u".*([0-9\uff10-\uff19]{3}[-\u30fc]{1}[0-9\uff10-\uff19]{4}).*"

Later, when you are matching this regex to some string, that other string should also be a unicode string, rather than a plain str. For that, you have to know in which encoding the input is. For instance, if the input is utf-8, use:

input_string_as_unicode = unicode(input_string_as_utf8, 'utf-8')
re.match(regstr, input_string_as_unicode)

Note that you possibly already have the input as unicode, if there is some framework behind doing that for you. Check type(input_string) if you are not sure.

Upvotes: 4

accdias
accdias

Reputation: 5372

I've just tried your tests on Python 3.6.6 and it worked as expected. The only different thing I did was the use of re.compile instead. Look:

Python 3.6.6 (default, Jul 19 2018, 14:25:17) 
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> zipcode = re.compile(r'.*([0-90-9]{3}[-ー]{1}[0-90-9]{4}).*')
>>> zipcode.match("this is obviously not going to work")
>>> zipcode.match("this is going to work 123-4567")
<_sre.SRE_Match object; span=(0, 30), match='this is going to work 123-4567'>
>>> zipcode.match("this is going to work 123-4567").group(0)
'this is going to work 123-4567'
>>> zipcode.match("this is going to work 123-4567").group(1)
'123-4567'
>>> zipcode.match("this is going to work too 123ー4567").group(1)
'123ー4567'
>>> zipcode.match("This should not work, but it does :  1234ー567").group(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> zipcode.match("This should not work, but it does :  1234ー567")
>>> zipcode.match("Now just seems crazy ....... 京都府")
>>> zipcode.match("京都府")
>>> 

EDIT

Here is what I have so far:

$ cat ziptest.py 
# -*- coding: utf-8 -*-
import re
zipcode = re.compile(r'.*([0-90123456789]{3}[-ー]{1}[0-90123456789]{4}).*')
tests = (
    "this is obviously not going to work",
    "this is going to work 123-4567",
    "this is going to work too 123ー4567",
    "This will not work, as it should not :  1234-567",
    "This should not work, but it does :  1234ー567",
    "Now just seems crazy ....... 京都府",
    "京都府",
    "\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c"
)

for test in tests:
    print('%s: %s' % (test, "Match" if zipcode.match(test) else "No match"))
$ 

And here are the results:

$ python2.7 ziptest.py 
this is obviously not going to work: No match
this is going to work 123-4567: Match
this is going to work too 123ー4567: Match
This will not work, as it should not :  1234-567: No match
This should not work, but it does :  1234ー567: Match
Now just seems crazy ....... 京都府: No match
京都府: No match
京都府: No match

$ python3.6 ziptest.py 
this is obviously not going to work: No match
this is going to work 123-4567: Match
this is going to work too 123ー4567: Match
This will not work, as it should not :  1234-567: No match
This should not work, but it does :  1234ー567: No match
Now just seems crazy ....... 京都府: No match
京都府: No match
京é½åº: No match

I hope it helps.

Upvotes: 1

Related Questions