Reputation: 157
I have some raw data like this:
Dear John Buy 1 of Coke, cost 10 dollars
Ivan Buy 20 of Milk
Dear Tina Buy 10 of Coke, cost 100 dollars
Mary Buy 5 of Milk
The rule of the data is:
Not everyone will start with "Dear", while if there is any, it must end with costs
The item may not always normal words, it could be written without limits (including str, num, etc.)
I want to group the information, and I tried to use regex. That's what I tried before:
for line in file.readlines():
match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>\w+)(?:\D+)(?P<costs>\d*)',line)
if match is not None:
print(match.groups())
file.close()
Now the output looks like:
('John', '1', 'Coke', '10')
('Ivan', '20', 'Milk', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Milk', '')
Showing above is what I want. However, if the item
is replaced by some strange string like A1~A10
, some of outputs will get wrong info:
('Ivan', '20', 'A1', '10')
('Mary', '5', 'A1', '10')
I think the constant format in the item field
is that it will always end with ,
(if there is any). But I just don't know how to use the advantage.
Thought it's temporarily success by using the code above, I thought the (?P<item>\w+)
has to be replaced like (?P<item>.+)
. If I do so, it'll take wrong string in the tuple like:
('John', '1', 'Coke, cost 10 dollars', '')
How could I read the data into the format I want by using the regex in Python?
Upvotes: 8
Views: 572
Reputation: 14955
I would use this regex
:
r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?'
Demo
>>> line = 'Dear Tina Buy 10 of A1~A10'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', None)
>>> line = 'Dear Tina Buy 10 of A1~A10, cost 100 dollars'
>>> match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)(?:,\D+)?(?P<costs>\d+)?', line)
>>> match.groups()
('Tina', '10', 'A1~A10', '100')
Explanation
The first section of your regex is perfectly fine, here’s the tricky part:
(?P<item>[^,]+)
As we're sure that the string will contain a comma when the cost string is present, here we say that we want anything but comma to set the item value.
(?:,\D+)?(?P<costs>\d+)?
Here we're using two groups. The important thing is the ?
after the parenthesis enclosing the groups:
'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
So we use ?
to match both possibilities (with the cost string present or not)
(?:,\D+)
is a non-capturing that will match a comma followed by anything but a digit.
(?P<costs>\d+)
will capture any digit in the named group cost.
Upvotes: 5
Reputation: 626927
If you use .+
, the subpattern will grab the whole rest of the line as .
matches any character but a newline without the re.S
flag.
You can replace the \w+
with a negated character class subpattern [^,]+
to match one or more characters other than a comma:
r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,]+)\D*(?P<costs>\d*)'
^^^^^
See the IDEONE demo:
import re
file = "Dear John Buy 1 of A1~A10, cost 10 dollars\n Ivan Buy 20 of Milk\nDear Tina Buy 10 of Coke, cost 100 dollars\n Mary Buy 5 of Milk"
for line in file.split("\n"):
match = re.search(r'\s+(?P<name>\w+)\D*(?P<num>\d+)\sof\s(?P<item>[^,\W]+)\D*(?P<costs>\d*)',line)
if match:
print(match.groups())
Output:
('John', '1', 'A1~A10', '10')
('Ivan', '20', 'Mil', '')
('Tina', '10', 'Coke', '100')
('Mary', '5', 'Mil', '')
Upvotes: 3
Reputation: 1575
I have tried this regular expression
^(Dear)?\s*(?P<name>\w*)\D*(?P<num>\d+)\sof\s(?P<drink>\w*)(,\D*(?P<cost>\d+)\D*)?
Explanation
^(Dear)?
match line starting either with Dear
if exists(?P<name>\w*)
a name capture group to capture the name\D*
match any non-digit characters(?P<num>\d+)
named capture group to get the num
.\sof\s
matching string of
(?P<drink>\w*)
to get the drink(,\D*(?P<cost>\d+)\D*)?
this is an optional group to get the cost of the drinkwith
>>> reobject = re.compile('^(Dear)?\s*(\w*)[\sa-zA-Z]*(\d+)\s*\w*\s*(\w*)(,[\sa-zA-Z]*(\d+)[\s\w]*)?')
First data snippet
>>> data1 = 'Dear John Buy 1 of Coke, cost 10 dollars'
>>> match_object = reobject.search(data1)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('John', '1', 'Coke', '10')
Second data snippet
>>> data2 = ' Ivan Buy 20 of Milk'
>>> match_object = reobject.search(data2)
>>> print (match_object.group('name') , match_object.group('num'), match_object.group('drink'), match_object.group('cost'))
('Ivan', '20', 'Milk', None)
Upvotes: 5
Reputation: 89567
Without regex:
with open('commandes.txt') as f:
results = []
for line in f:
parts = line.split(None, 5)
price = ''
if parts[0] == 'Dear':
tmp = parts[5].split(',', 1)
for tok in tmp[1].split():
if tok.isnumeric():
price = tok
break
results.append((parts[1], parts[3], tmp[0], price))
else:
results.append((parts[0], parts[2], parts[4].split(',')[0], price))
print(results)
It doesn't care what characters are used except spaces until the product name, that's why each line is splitted by spaces in 5 parts. When the line starts with "Dear", the last part is separated by the comma to extract the product name and the price. Note that if the price is always at the same place (ie: after "cost"), you can avoid the innermost for loop and replace it with price = tmp[1].split()[1]
Note: if you want to prevent empty lines to be processed, you can change the first for loop to:
for line in (x for x in f if x.rstrip()):
Upvotes: 5