Reputation: 8215
I've got a file that looks like this:
useless stuff
fruit: apple
fruit: banana
useless stuff
fruit: kiwi
fruit: orange
fruit: pear
useless stuff
The idea is to catch all the fruit names, in the order that they appear, and by groups. With the above example, output would have to be something like:
[['apple', 'banana'], ['kiwi', 'orange', 'pear']]
I succeed doing this by iterating over all the matches for the multiline regexp '^fruit: (.+)$'
, and by adding fruit names to a same given list if it appears that the lines where they were found follow each other.
However, this is unpractical for doing substitutions on the fruit names (keeping track of the matches start and end index becoming then mandatory), so I would prefer to do this in a single regexp.
I've tried this:
re.findall(r'(?:^fruit: (.+)$\n)+', thetext, re.M)
But it only returns one line.
Where am I wrong ?
Upvotes: 2
Views: 175
Reputation: 215059
You cannot do "grouping" this way in regular expressions, because normally a group captures only its latest match. A workaround would be to repeat a group literally:
matches = re.findall(r'(?m)(?:^fruit: (.+)\n)(?:^fruit: (.+)\n)?(?:^fruit: (.+)\n)?', text)
# [('apple', 'banana', ''), ('kiwi', 'orange', 'pear')]
If this is appropriate to your task (say, no more than 5-6 groups), you can easily generate such expressions on the fly. If not, the only option is a two-pass match (I guess this is similar to what you already have):
matches = [re.findall(': (.+)', x)
for x in re.findall(r'(?m)((?:^fruit: .+\n)+)', text)]
# [['apple', 'banana'], ['kiwi', 'orange', 'pear']]
A non-standard (yet) regex module provides an interesting method called "captures". m.captures(n)
returns all matches for a group, not only the latest one, like m.group(n)
does:
import regex
matches = [x.captures(2) for x in regex.finditer(r'(?m)((?:^fruit: (.+)\n)+)', text)]
# [['apple', 'banana'], ['kiwi', 'orange', 'pear']]
Upvotes: 1
Reputation: 98118
Another way:
import re
with open('input') as file:
lines = "".join(file.readlines())
fruits = [[]]
for fruit in re.findall(r'(?:fruit: ([^\n]*))|(?:\n\n)', lines, re.S):
if fruit == '':
if len(fruits[-1]) > 0:
fruits.append([])
else:
fruits[-1].append(fruit)
del fruits[-1]
print fruits
Output
[['apple', 'banana'], ['kiwi', 'orange', 'pear']]
Upvotes: 1
Reputation: 133764
This allows you to keep your regex, as you said you may need more complex expressions later:
>>> import re
>>> from itertools import groupby
>>> with open('test.txt') as fin:
groups = groupby((re.match(r'(?:fruit: )(.+)', line) for line in fin),
key=bool) # groups based on whether each line matched
print [[m.group(1) for m in g] for k, g in groups if k]
# prints each matching group
[['apple', 'banana'], ['kiwi', 'orange', 'pear']]
Without regex:
>>> with open('test.txt') as f:
print [[x.split()[1] for x in g]
for k, g in groupby(f, key=lambda s: s.startswith('fruit'))
if k]
[['apple', 'banana'], ['kiwi', 'orange', 'pear']]
Upvotes: 1
Reputation: 474
how about:
re.findall(r'fruit: ([\w]+)\n|[^\n]*\n', str, re.M);
the result:
['', '', 'apple', 'banana', '', '', '', 'kiwi', 'orange', 'pear', '']
this can be easily converted to [['apple', 'banana'], ['kiwi', 'orange', 'pear']]
Upvotes: 0
Reputation: 22623
I think you will see the problem if you make the inner group non-capturing like so:
re.findall(r'(?:^fruit: (?:.+)$\n)+', thetext, re.M)
# result:
['fruit: apple\nfruit: banana\n', 'fruit: kiwi\nfruit: orange\nfruit: pear\n']
The problem is that each match is matching an entire bunch of fruit:
lines, but the capturing group (in your original soln) captures multiple times. Since a capture group can have only one value associated with it, it ends up with the last captured substring (I think the choice of last is arbitrary; I wouldn't count on this behavior).
Upvotes: 1
Reputation: 4128
I'm not a big fan of using regex unless you absolutely have to. Taking a step backwards and looking at your case, my first inclination is to think if you shouldn't in fact be massaging the input files into something like csv using a specialised tool like awk before feeding it into python.
Having said that, you can of course still accomplish what you're looking to do using clear regex-free python. An example (which I'm sure can be reduced without sacrificing transparency):
# newlst keeps track of whether you should start a new sublist
newlst=False
# result is the end result list of lists
result = []
# lst is the sublist which gets reset every time a grouping concludes
lst = []
with open('input.txt') as f:
for line in f.readlines():
# is the first token NOT a fruit?
if line.split(':')[0] != 'fruit':
# if so, start a new sublist
newlst=True
# just so we don't append needless empty sublists
if len(lst) > 0: result.append(lst)
# initialise a new sublist, since last line wasn't a fruit and
# this implies a new group is starting
lst = []
else:
# first token IS a fruit. So append it to the sublist
lst.append(line.split()[1])
print result
Upvotes: 0