python regex with unicode to match a city name

Question

I have a city name in unicode, and I want to match it with regex, but I also want to validate when it is a string, like "New York". I searched a little bit and tried something attached below, but could not figure out how?

I tried this regex "([\u0000-\uFFFF]+)" on this website:http://regex101.com/#python and it works, but could not get it working in python.

Thanks in advance!!

city=u"H\u0101na"
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
mcity.group(0)
u'H'

bobince · Accepted Answer

mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)

Unlike \x, \u is not a special sequence in regex syntax, so your character group matches a literal backslash, letter U, and so on.

To refer to non-ASCII in a regex you have to include them as raw characters in a Unicode string, for example as:

mcity=re.search(u"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)

(If you don't want to double-backslash the \s, you could also use a ur string, in which \u still works as an escape but the other escapes like \x don't. This is a bit confusing though.)

This character group is redundant: including the range U+0000 to U+FFFF already covers all of A-Za-z\s, and indeed the whole Basic Multilingual Plane including control characters. On a narrow build of Python (including Windows Python 2 builds), where the characters outside the BMP are represented using surrogate pairs in the range U+D800 to U+DFFF, you are actually allowing every single character, so it's not much of a filter. (.+ would be a simpler way of putting it.)

Then again it's pretty difficult to express what might constitute a valid town name in different parts of the world. I'd be tempted to accept anything that, shorn of control characters and leading/trailing whitespace, wasn't an empty string.

python regex with unicode to match a city name

Answers (1)

Related Questions