Rob Cowell
Rob Cowell

Reputation: 1618

Why is this python regex not matching any groups?

Helping a friend with some inherited code in a propietary tool.

Neither of us are too familiar with Python or Regex.

In the code below, the degF regex matches 2 groups when I use Pythex (http://pythex.org/) but returns None. What am I doing wrong?

# This Python file uses the following encoding: utf-8
import os, sys
import re

testString = "Friday: Thundery Shower, Maximum Temperature: 27°C (81°F) Minimum Temperature: 17°C (63°F)"

t = re.match("^([^:]+):\s*([^,]+)", testString)
degF = re.match("^(\d+.F\))", testString)

print t .           # _sre.SRE_Match object
print t.group(1)    # Friday
print t.group(2)    # Thundery Shower
print degF          # None

# print "Max temp " + degF.group(1)
# print "Min temp " + degF.group(2)

Upvotes: 1

Views: 73

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

Your string contains characters out of the ASCII range that are encoded with two bytes (in UTF-8), but your string isn't defined as an unicode string and the grapheme ° is seen as 2 different characters.

If you want the dot to match ° as a single grapheme, you need to define your string as a unicode string:

testString = u"Friday: Thundery Shower, Maximum Temperature: 27°C (81°F) Minimum Temperature: 17°C (63°F)"

Then the pattern \d+.F will match without any problem.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626835

You used . in your pattern to match a degree symbol. However, a . matches a single byte, while ° is in fact two bytes long:

print len('°') # => 2

So, you may just use ° instead of . in your degF pattern (or \W* to match zero or more non-word chars, i.e. r"(\d+\W+F)\)"), use re.search everywhere and remove the ^ if you do not plan to match only at the start of the string:

degF = re.findall(r"(\d+°F)\)", testString)
print(degF)                                 # => ['81\xc2\xb0F', '63\xc2\xb0F']

See the Python demo

You may shift the unescaped ) to right after \d+ to only match whole numbers. You may change \d to \d[\d.]* to match float or integer numbers.

Upvotes: 2

Błotosmętek
Błotosmętek

Reputation: 12927

Your regexp here begins with ^ (and additionally re.match matches only at the beginning of string), but your testString doesn't begin with a sequence of digits.

Upvotes: 2

Related Questions