michaelmeyer
michaelmeyer

Reputation: 8215

Multiline regex matching

I've got a file that looks like this:

useless stuff

fruit: apple
fruit: banana

useless stuff

fruit: kiwi
fruit: orange
fruit: pear

useless stuff

The idea is to catch all the fruit names, in the order that they appear, and by groups. With the above example, output would have to be something like:

[['apple', 'banana'], ['kiwi', 'orange', 'pear']]

I succeed doing this by iterating over all the matches for the multiline regexp '^fruit: (.+)$', and by adding fruit names to a same given list if it appears that the lines where they were found follow each other.

However, this is unpractical for doing substitutions on the fruit names (keeping track of the matches start and end index becoming then mandatory), so I would prefer to do this in a single regexp.

I've tried this:

re.findall(r'(?:^fruit: (.+)$\n)+', thetext, re.M)

But it only returns one line.

Where am I wrong ?

Upvotes: 2

Views: 175

Answers (6)

georg
georg

Reputation: 215059

You cannot do "grouping" this way in regular expressions, because normally a group captures only its latest match. A workaround would be to repeat a group literally:

matches = re.findall(r'(?m)(?:^fruit: (.+)\n)(?:^fruit: (.+)\n)?(?:^fruit: (.+)\n)?', text)
# [('apple', 'banana', ''), ('kiwi', 'orange', 'pear')]

If this is appropriate to your task (say, no more than 5-6 groups), you can easily generate such expressions on the fly. If not, the only option is a two-pass match (I guess this is similar to what you already have):

matches = [re.findall(': (.+)', x) 
    for x in re.findall(r'(?m)((?:^fruit: .+\n)+)', text)]
# [['apple', 'banana'], ['kiwi', 'orange', 'pear']]

A non-standard (yet) regex module provides an interesting method called "captures". m.captures(n) returns all matches for a group, not only the latest one, like m.group(n) does:

import regex
matches = [x.captures(2) for x in regex.finditer(r'(?m)((?:^fruit: (.+)\n)+)', text)]
# [['apple', 'banana'], ['kiwi', 'orange', 'pear']]

Upvotes: 1

perreal
perreal

Reputation: 98118

Another way:

import re
with open('input') as file:
    lines = "".join(file.readlines())
    fruits = [[]]
    for fruit in re.findall(r'(?:fruit: ([^\n]*))|(?:\n\n)', lines, re.S):
        if fruit == '': 
            if len(fruits[-1]) > 0:
                fruits.append([])
        else:
            fruits[-1].append(fruit)
    del fruits[-1]
    print fruits

Output

[['apple', 'banana'], ['kiwi', 'orange', 'pear']]

Upvotes: 1

jamylak
jamylak

Reputation: 133764

This allows you to keep your regex, as you said you may need more complex expressions later:

>>> import re
>>> from itertools import groupby
>>> with open('test.txt') as fin:
        groups = groupby((re.match(r'(?:fruit: )(.+)', line) for line in fin),
                         key=bool) # groups based on whether each line matched
        print [[m.group(1) for m in g] for k, g in groups if k]
        # prints each matching group


[['apple', 'banana'], ['kiwi', 'orange', 'pear']]

Without regex:

>>> with open('test.txt') as f:
        print [[x.split()[1] for x in g]
               for k, g in groupby(f, key=lambda s: s.startswith('fruit'))
               if k]


[['apple', 'banana'], ['kiwi', 'orange', 'pear']]

Upvotes: 1

user2264587
user2264587

Reputation: 474

how about:

re.findall(r'fruit: ([\w]+)\n|[^\n]*\n', str, re.M);

the result:

['', '', 'apple', 'banana', '', '', '', 'kiwi', 'orange', 'pear', '']

this can be easily converted to [['apple', 'banana'], ['kiwi', 'orange', 'pear']]

example in ideone

Upvotes: 0

allyourcode
allyourcode

Reputation: 22623

I think you will see the problem if you make the inner group non-capturing like so:

re.findall(r'(?:^fruit: (?:.+)$\n)+', thetext, re.M)
# result:
['fruit: apple\nfruit: banana\n', 'fruit: kiwi\nfruit: orange\nfruit: pear\n']

The problem is that each match is matching an entire bunch of fruit: lines, but the capturing group (in your original soln) captures multiple times. Since a capture group can have only one value associated with it, it ends up with the last captured substring (I think the choice of last is arbitrary; I wouldn't count on this behavior).

Upvotes: 1

JosefAssad
JosefAssad

Reputation: 4128

I'm not a big fan of using regex unless you absolutely have to. Taking a step backwards and looking at your case, my first inclination is to think if you shouldn't in fact be massaging the input files into something like csv using a specialised tool like awk before feeding it into python.

Having said that, you can of course still accomplish what you're looking to do using clear regex-free python. An example (which I'm sure can be reduced without sacrificing transparency):

# newlst keeps track of whether you should start a new sublist
newlst=False
# result is the end result list of lists
result = []
# lst is the sublist which gets reset every time a grouping concludes
lst = []

with open('input.txt') as f:
    for line in f.readlines():
        # is the first token NOT a fruit?
        if line.split(':')[0] != 'fruit':
            # if so, start a new sublist
            newlst=True
            # just so we don't append needless empty sublists
            if len(lst) > 0: result.append(lst)
            # initialise a new sublist, since last line wasn't a fruit and
            # this implies a new group is starting
            lst = []
        else:
            # first token IS a fruit. So append it to the sublist
            lst.append(line.split()[1])

print result

Upvotes: 0

Related Questions