Reputation: 157

Grouping data with a regex in Python

I have some raw data like this:

Dear   John    Buy   1 of Coke, cost 10 dollars
       Ivan    Buy  20 of Milk
Dear   Tina    Buy  10 of Coke, cost 100 dollars
       Mary    Buy   5 of Milk

The rule of the data is:

Not everyone will start with "Dear", while if there is any, it must end with costs
The item may not always normal words, it could be written without limits (including str, num, etc.)

I want to group the information, and I tried to use regex. That's what I tried before:

for line in file.readlines():
    match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>\w+)(?:\D+)(?P<costs>\d*)',line)
    if match is not None:
        print(match.groups())
file.close()

Now the output looks like:

('John', '1', 'Coke', '10')
('Ivan', '20', 'Milk', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Milk', '')

Showing above is what I want. However, if the item is replaced by some strange string like A1~A10, some of outputs will get wrong info:

('Ivan', '20', 'A1', '10')
('Mary', '5', 'A1', '10')

I think the constant format in the item field is that it will always end with , (if there is any). But I just don't know how to use the advantage.

Thought it's temporarily success by using the code above, I thought the (?P<item>\w+) has to be replaced like (?P<item>.+). If I do so, it'll take wrong string in the tuple like:

('John', '1', 'Coke, cost 10 dollars', '')

How could I read the data into the format I want by using the regex in Python?

Upvotes: 8

Answers (4)

Juan Diego Godoy Robles

Reputation: 14955

I would use this regex:

r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?'

Demo

>>> line = 'Dear   Tina    Buy  10 of A1~A10'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', None)

>>> line = 'Dear   Tina    Buy  10 of A1~A10, cost 100 dollars'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', '100')

Explanation

The first section of your regex is perfectly fine, here’s the tricky part:

(?P<item>[^,]+) As we're sure that the string will contain a comma when the cost string is present, here we say that we want anything but comma to set the item value.

(?:,\D+)?(?P<costs>\d+)? Here we're using two groups. The important thing is the ? after the parenthesis enclosing the groups:

'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

So we use ? to match both possibilities (with the cost string present or not)

(?:,\D+) is a non-capturing that will match a comma followed by anything but a digit.

(?P<costs>\d+) will capture any digit in the named group cost.

Upvotes: 5

Wiktor Stribiżew

Reputation: 626927

If you use .+, the subpattern will grab the whole rest of the line as . matches any character but a newline without the re.S flag.

You can replace the \w+ with a negated character class subpattern [^,]+ to match one or more characters other than a comma:

r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)\D*(?P<costs>\d*)'
                                                ^^^^^

See the IDEONE demo:

import re
file = "Dear   John    Buy   1 of A1~A10, cost 10 dollars\n       Ivan    Buy  20 of Milk\nDear   Tina    Buy  10 of Coke, cost 100 dollars\n       Mary    Buy   5 of Milk"
for line in file.split("\n"):
    match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,\W]+)\D*(?P<costs>\d*)',line)
    if match:
        print(match.groups())

Output:

('John', '1', 'A1~A10', '10')
('Ivan', '20', 'Mil', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Mil', '')

Upvotes: 3

saikumarm

Reputation: 1575

I have tried this regular expression

^(Dear)?\s*(?P<name>\w*)\D*(?P<num>\d+)\sof\s(?P<drink>\w*)(,\D*(?P<cost>\d+)\D*)?

Explanation

^(Dear)? match line starting either with Dear if exists
(?P<name>\w*) a name capture group to capture the name
\D* match any non-digit characters
(?P<num>\d+) named capture group to get the num.
\sof\s matching string of
(?P<drink>\w*) to get the drink
(,\D*(?P<cost>\d+)\D*)? this is an optional group to get the cost of the drink

with

>>> reobject = re.compile('^(Dear)?\s*(\w*)[\sa-zA-Z]*(\d+)\s*\w*\s*(\w*)(,[\sa-zA-Z]*(\d+)[\s\w]*)?')

First data snippet

>>> data1 = 'Dear   John    Buy   1 of Coke, cost 10 dollars'
>>> match_object = reobject.search(data1)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('John', '1', 'Coke', '10')

Second data snippet

>>> data2 = '       Ivan    Buy  20 of Milk'
>>> match_object = reobject.search(data2)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('Ivan', '20', 'Milk', None)

Upvotes: 5

Casimir et Hippolyte

Reputation: 89567

Without regex:

with open('commandes.txt') as f:
    results = []
    for line in f:
        parts = line.split(None, 5)
        price = ''
        if parts[0] == 'Dear':
            tmp = parts[5].split(',', 1)
            for tok in tmp[1].split():
                if tok.isnumeric():
                    price = tok
                    break 
            results.append((parts[1], parts[3], tmp[0], price))
        else:
            results.append((parts[0], parts[2], parts[4].split(',')[0], price))
    print(results)

It doesn't care what characters are used except spaces until the product name, that's why each line is splitted by spaces in 5 parts. When the line starts with "Dear", the last part is separated by the comma to extract the product name and the price. Note that if the price is always at the same place (ie: after "cost"), you can avoid the innermost for loop and replace it with price = tmp[1].split()[1]

Note: if you want to prevent empty lines to be processed, you can change the first for loop to:

for line in (x for x in f if x.rstrip()):

Upvotes: 5

Grouping data with a regex in Python

Answers (4)

Related Questions