Turtles Are Cute
Turtles Are Cute

Reputation: 3426

Split text into sections using python regex

I have a large, multi-line string with multiple entries following a similar format. I'd like to split it into a list of strings for each entry.

I tried the following:

myre = re.compile('Record\sTime.*-{5}', re.DOTALL)
return re.findall(myre, text)

In this case, entries start with 'Record Time', and end with '-----'. Instead of acting how I'd like, the code above returns one item, starting at beginning of the first entry, and ending at the end of the last one.

I could probably find a way to make this work by using regex to find the end of a segment, then repeat with a slice of the original text starting there, but that seems messy.

Upvotes: 2

Views: 2086

Answers (3)

dawg
dawg

Reputation: 104092

Something like this:

txt='''\
Record Time
1
2
3
-----

Record Time
4
5
-----
Record Time
6
7
8
'''

import re
pat=re.compile(r'^Record Time$(.*?)(?:^-{5}|\Z)', re.S | re.M)
for i, block in enumerate((m.group(1) for m in pat.finditer(txt))):
    print 'block:', i
    print block.strip()

Prints:

block: 0
1
2
3
block: 1
4
5
block: 2
6
7
8

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89629

You can use this to avoid a reluctant quantifier, it's a trick to emulate an atomic group: (?=(...))\1. It's not totally in the subject but it can be usefull:

myre = re.compile('Record\sTime(?:(?=([^-]+|-(?!-{4})))\1)+-{5}')

Upvotes: 1

NPE
NPE

Reputation: 500883

You need to turn the .* into a reluctant match, by adding a question mark:

.*?

Otherwise it matches as much as it can, from the middle of the first record to the middle of the last record.

See Greedy vs. Reluctant vs. Possessive Quantifiers

Upvotes: 5

Related Questions