Reputation: 8187
I'm trying to run a simple regex match to extract parts of a dms coordinate.
The regex seems solid enough, when copying the below to regexr i get a match:
regex: ^([0-9]{1,2})[:|°| ]([0-9]{1,2})[:|'|′| ]?([0-9]{1,4}(?:\.[0-9]+){0,1})?[\"|″| ]([N|S])$
body: 51°30'30.7080"N
In my python script i have to following:
# coding: utf8
import re
dms_regex = "^([0-9]{1,2})[:|°| ]([0-9]{1,2})[:|'|′| ]?([0-9]{1,4}(?:\.[0-9]+){0,1})?[\"|″| ]([N|S])$"
def dmsToCoordinates(dms_lat):
print(dms_lat)
matches = re.search(dms_regex, dms_lat)
print(matches)
dmsToCoordinates("51°30'30.7080\"N")
My terminal output is:
51°30'30.7080"N
None
Obviously I expect to get a match. Please let me know what I'm doing wrong before i throw myself out of the window.
Upvotes: 0
Views: 553
Reputation: 35080
As it turned out (mostly by accident) in the comment thread on the question: you were using the wrong (old) version of python by mistake.
One of the main problems (if not the main problem) with python 2 was how handling strings was completely broken. The str
type was a bytestring, with a separate unicode type for unicode text. This mushed together what's text and what's data.
So when you entered a non-ascii character (such as the degree sign) and ran your code in python 2, this happened:
>>> '°'
'\xc2\xb0'
>>> len('°')
2
>>> '°'.decode('utf8')
u'\xb0'
>>> len('°'.decode('utf8'))
1
The '°'
(byte)string literal becomes two bytes as data, but still masquerading as a string! It's only a proper single-character string if you encode it and end up with a unicode string. So when you put it into a character class:
>>> 'f[°]oo'
'f[\xc2\xb0]oo'
the two bytes will act like two characters in the character class, one '\xc2'
, the other '\xb0'
. This means it won't match the '\xc2\xb0'
bytes is inserted instead of another literal degree sign in the target string:
>>> re.search('f[°]oo', 'f°oo') is None
True
>>> re.search('f[\xc2\xb0]oo', 'f\xc2\xb0oo') is None # exact same thing as previous
True
>>> re.search('f[\xc2\xb0]oo', 'f\xc2oo') is None
False
So your accidental use of python 2 caused the regex to break on the non-ascii characters, which is fundamentally due to how strings are broken in python 2. If someone were to use the code in python 2 they would at least have to switch to u''
unicode literals in both the pattern and the target, which would magically fix the problem.
And two general notes about your regex:
you should remove all those pipes from your character classes. Character classes mean "use a character from this bag of characters", meaning that
(a) pipes aren't needed to enforce the choice between characters in the character class, and
(b) it will match actual pipes in the target: re.search('[a|b]','|') is not None
you should use raw string literals for representing regular expressions, otherwise you would occasionally have to escape the backslashes in escape sequences in order to get a proper match and prevent ambiguity.
So I suggest this pattern instead:
dms_regex = r"^([0-9]{1,2})[:° ]([0-9]{1,2})[:'′ ]?([0-9]{1,4}(?:\.[0-9]+){0,1})?[\"″ ]([NS])$"
Upvotes: 3