Reputation: 557
I have a city name in unicode, and I want to match it with regex, but I also want to validate when it is a string, like "New York". I searched a little bit and tried something attached below, but could not figure out how?
I tried this regex "([\u0000-\uFFFF]+)" on this website:http://regex101.com/#python and it works, but could not get it working in python.
Thanks in advance!!
city=u"H\u0101na"
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
mcity.group(0)
u'H'
Upvotes: 1
Views: 926
Reputation: 536369
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
Unlike \x
, \u
is not a special sequence in regex syntax, so your character group matches a literal backslash, letter U, and so on.
To refer to non-ASCII in a regex you have to include them as raw characters in a Unicode string, for example as:
mcity=re.search(u"([\u0000-\uFFFFA-Za-z\\s]+)", city, re.U)
(If you don't want to double-backslash the \s
, you could also use a ur
string, in which \u
still works as an escape but the other escapes like \x
don't. This is a bit confusing though.)
This character group is redundant: including the range U+0000 to U+FFFF already covers all of A-Za-z\s
, and indeed the whole Basic Multilingual Plane including control characters. On a narrow build of Python (including Windows Python 2 builds), where the characters outside the BMP are represented using surrogate pairs in the range U+D800 to U+DFFF, you are actually allowing every single character, so it's not much of a filter. (.+
would be a simpler way of putting it.)
Then again it's pretty difficult to express what might constitute a valid town name in different parts of the world. I'd be tempted to accept anything that, shorn of control characters and leading/trailing whitespace, wasn't an empty string.
Upvotes: 1