AKarpun
AKarpun

Reputation: 331

Search for multiple RegEx substrings

I've got a SQLite table with sells records - where in field 13 located shipping prices - there are essentially 3 possibilities:

Price: for ex. £15.20 free not specified

Problem is there is not always only these words: for ex. it can say "shipping is £15.20" or "shipping free" - I need to normalize it to the aforementioned possibilities. I use RegEx:

def correct_shipping(db_data):
pattern=re.compile("\£(\d+.\d+)") #search for price
pattern_free=re.compile("free") #search for free shipping
pattern_not=re.compile("not specified") #search for shipping not specified 

for every_line in db_data:
    try:
        found=pattern.search(every_line[13].replace(',','')).group(1)
    except:
        try:
            found=pattern_free.search(every_line[13]).group()
        except:
            found=pattern_not.search(every_line[13]).group()

    if found:
        query="UPDATE MAINTABLE SET Shipping='"+found+"' WHERE Id="+str(every_line[0])
        db_cursor.execute(query)
db_connection.commit()

But this code is raising exception
AttributeError: 'NoneType' object has no attribute 'group' - first result in form "5.20" trigger it because none of patterns is found.
Question is how to properly search for string (is try/except is necessary at all ?) or how just ignore exception if none of the strings is found (this is not so good solution though ?)

Upvotes: 0

Views: 317

Answers (2)

7stud
7stud

Reputation: 48589

Don't search for the pound sign. Search for the numbers, then manually add the pound sign yourself.

import re

strings = [
    "5.20",
    "$5.20",
    "$.50",
    "$5",
    "Shipping is free",
    "Shipping: not specified",
    "free",
    "not specified",
]

pattern = r"""
    \d*                     #A digit 0 or more times 
    [.]?                    #A dot, optional
    \d+                     #A digit, one or more times 
    | free                  #Or the word free
    | not \s+ specified     #Or the phrase "not specified"
"""

regex = re.compile(pattern, flags=re.X)
results = []

for string in strings:
    md = re.search(regex, string)

    if md:
        match = md.group()
        if re.search(r"\d", match):
            match = "$" + match
        results.append(match)
    else:
        print "Error--no match!"

print results

--output:--
['$5.20', '$5.20', '$.50', '$5', 'free', 'not specified', 'free', 'not specified']

Upvotes: 0

abarnert
abarnert

Reputation: 365597

The first problem is that your code doesn't handle failure correctly. If you want to use functions that return None on no match, you either have to check for None, or handle the AttributeError that results from trying to call group on it.

You could just layer one more try/except under the first two. But this gets very hard to read. A function like this will be a lot simpler:

match = pattern.search(every_line[13].replace(',',''))
if match:
    return match.group(1)
match = pattern_not.search(every_line[13])
if match:
    return match.group()
match = pattern_not.search(every_line[13])
if match:
    return match.group()

This uses the same regexps as your code, but doesn't have the problem of trying to call group whether or not each match succeeds, so it works, nice and simply.


There are ways you could simplify this further. For example, you don't need to use regexps to search for fixed strings like "free"; you can just use str.find or str.index. Or, alternatively, you could use search with a single regexp with a three-way alternation in it, instead of doing three separate searches.


The next problem is that your first pattern is wrong. You shouldn't be backslash-escaping anything but regexp special characters (or Python special characters… but you should be using raw strings so you don't need to escape those), and the pound sign isn't one of them.

More importantly, if this is Python 2.x, you should never, ever put non-ASCII characters into string literals; only put them in Unicode literals. (And only if you specify an encoding for the source file.)

Python's regexp engine can handle Unicode… but not if you give it mojibake, like a UTF-8 pound sign decoded as Latin-1 or something. (In fact, even if you get all the encoding right, it's better to give it Unicode patterns and search strings than encoded ones. Otherwise, it has no way of knowing it's searching for Unicode, or that some of the characters are more than a byte long, etc.)

Upvotes: 2

Related Questions