Reputation: 1618
Helping a friend with some inherited code in a propietary tool.
Neither of us are too familiar with Python or Regex.
In the code below, the degF regex matches 2 groups when I use Pythex (http://pythex.org/) but returns None. What am I doing wrong?
# This Python file uses the following encoding: utf-8
import os, sys
import re
testString = "Friday: Thundery Shower, Maximum Temperature: 27°C (81°F) Minimum Temperature: 17°C (63°F)"
t = re.match("^([^:]+):\s*([^,]+)", testString)
degF = re.match("^(\d+.F\))", testString)
print t . # _sre.SRE_Match object
print t.group(1) # Friday
print t.group(2) # Thundery Shower
print degF # None
# print "Max temp " + degF.group(1)
# print "Min temp " + degF.group(2)
Upvotes: 1
Views: 73
Reputation: 89557
Your string contains characters out of the ASCII range that are encoded with two bytes (in UTF-8), but your string isn't defined as an unicode string and the grapheme °
is seen as 2 different characters.
If you want the dot to match °
as a single grapheme, you need to define your string as a unicode string:
testString = u"Friday: Thundery Shower, Maximum Temperature: 27°C (81°F) Minimum Temperature: 17°C (63°F)"
Then the pattern \d+.F
will match without any problem.
Upvotes: 1
Reputation: 626835
You used .
in your pattern to match a degree symbol. However, a .
matches a single byte, while °
is in fact two bytes long:
print len('°') # => 2
So, you may just use °
instead of .
in your degF
pattern (or \W*
to match zero or more non-word chars, i.e. r"(\d+\W+F)\)"
), use re.search
everywhere and remove the ^
if you do not plan to match only at the start of the string:
degF = re.findall(r"(\d+°F)\)", testString)
print(degF) # => ['81\xc2\xb0F', '63\xc2\xb0F']
See the Python demo
You may shift the unescaped )
to right after \d+
to only match whole numbers. You may change \d
to \d[\d.]*
to match float or integer numbers.
Upvotes: 2
Reputation: 12927
Your regexp here begins with ^
(and additionally re.match
matches only at the beginning of string), but your testString
doesn't begin with a sequence of digits.
Upvotes: 2