sfortney
sfortney

Reputation: 2123

Breaking up text document into sections with Python using Regex match on section titles

The documents I have have sections that are nicely denoted by titles. I want to split the documents into sections using these titles. Example:

1.1 Lorem Ipsum

Blah blah blah
9 (page break, never will have a period in it though)
Bleh bleh bleh as referenced in Section 1.3 hey hey hey

1.2 Lorem Ipsumus

Blah blah blah

I would like a regex expression that can take the titles and the text until the next title appears. So the desired results for the example would be,

1.1 Lorem Ipsum Blah blah blah bleh bleh bleh as referenced in Section 1.3 hey hey hey

And

1.2 Lorem Ipsumus Blah blah blah

One thing I can always count on is that the section titles will be a new line that starts with some sort of number x.x followed by a few words and since that is pretty unique to the titles, that what I would like to search on.

Basically if I see anything that is a new line and of the form "Section 1.2 Definitions" I know that is a new section and would like to grab all the text from there until the next new line that starts with "Section 1.3 Examples" or perhaps "Section 2.1 Terms". Section titles always start a new line and are of the form "Section 1.3 Examples", "Article 1.3 Examples", or "1.3 Examples".

Sometimes there are references to titles in the middle of a line and these I would like to ignore. This can be seen in the example.

Does anyone know how to do this? Preferebly in python but the regex should be sufficient if not.

p.s. Keeping the page numbers or not is optional but the regex would ideally not create new sections based on the page numbers


EDIT: So far, here is the MWE I have running. Its not quite there.

import re
doc_splitter = re.compile(r"(?<=\n)(?P<secname>[\w]+ )(\d+\.\d+ .*?)(?<=\n)(?P<secname2>[\w]+ )(?=\d+\.\d+|\Z)", re.DOTALL)

text = """

Section 1.1 Lorem Ipsum

Blah blah blah
9
Bleh bleh bleh Section 1.1 hey hey hey

Section 1.2 Lorem Ipsumus 
ref Section 1.3

Blah blah blah

Section 1.3 hey hey

Section 1.4

"""


for match in doc_splitter.finditer(text):
    print([match.group()])

Ideally it would return:

['Section 1.1 Lorem Ipsum Blah blah blah 9 Bleh bleh bleh Section 1.1 hey hey hey']
['Section 1.2 Lorem Ipsumus ref Section 1.3 Blah blah blah']
['Section 1.3 hey hey']
['Section 1.4']

But instead it returns:

['Section 1.1 Lorem Ipsum\n\nBlah blah blah\n9\nBleh bleh bleh Section 1.1 hey hey hey\n\nSection ']
['Section 1.3 hey hey\n\nSection ']

Thanks for all the help everyone! If anyone has any thoughts on how to get this last problem fixed it would be very appreciated.

Upvotes: 1

Views: 2985

Answers (3)

Arne
Arne

Reputation: 20147

The regex you are looking for might be similar to this:

doc_splitter = re.compile(r"(?<=\n)(\d+\.\d+ .*?)(?<=\n)(?=\d+\.\d+|$)", re.DOTALL)

, which, given python code, can be run on the whole document with finditer:

text = """
1.1 Lorem Ipsum

Blah blah blah
9 (page break, never will have a period in it though)
Bleh bleh bleh

1.2 Lorem Ipsumus

Blah blah blah"""
for match in doc_splitter.finditer(text):
    print([match.group()])  # print in list to suppress \n interpretation 

Prints:

['1.1 Lorem Ipsum\n\nBlah blah blah\n9 (page break, never will have a period in it though)\nBleh bleh bleh\n\n']
['1.2 Lorem Ipsumus\n\nBlah blah blah\n']

which seems to be what you want.

If you iterate the data differently you might be able to get rid of the cumbersome lookaround assertions, which might not cleanly translate into other languages that demand constant length lookarounds. The core is given with (\d+\.\d+ .*?) and forcing a full match.


Alternative

Jan's answer is good, but I also wanted to add a solution that solves the problem without lookahead conditions, since they look redundant:

import re
doc_splitter = re.compile(r"^(?:Section\ )?\d+\.\d+", re.MULTILINE)
text = """

Section 1.1 Lorem Ipsum

Blah blah blah
9
Bleh bleh bleh Section 1.1 hey hey hey

Section 1.2 Lorem Ipsumus 
ref Section 1.3

Blah blah blah

Section 1.3 hey hey

Section 1.4

"""
starts = [match.span()[0] for match in doc_splitter.finditer(text)] + [len(text)]
sections = [text[starts[idx]:starts[idx+1]] for idx in range(len(starts)-1)]
for section in sections:
    print([section])

Prints:

['Section 1.1 Lorem Ipsum\n\nBlah blah blah\n9\nBleh bleh bleh Section 1.1 hey hey hey\n\n']
['Section 1.2 Lorem Ipsumus \nref Section 1.3\n\nBlah blah blah\n\n']
['Section 1.3 hey hey\n\n']
['Section 1.4\n\n']

The regex only searches for the start of a new section, and should be easy enough to maintain and extend. We have to go through the additional step of splitting the text by hand from each new start, which serves as the ending for the former section.

While a regex is perfectly capable of handling this kind of matching in a single step, I personally prefer to keep them as short as possible. They are difficult enough to understand already.

Upvotes: 1

Jan
Jan

Reputation: 43169

Just to put my two cents in - you could use

^
(?:Section\ )?\d+\.\d+
[\s\S]*?
(?=^(?:Section\ )?\d+\.\d+|\Z)

with the verbose and multiline modifier, see a demo on regex101.com.


In Python:

import re

data = """
1.1 Lorem Ipsum

Blah blah blah
9 (page break, never will have a period in it though)
Bleh bleh bleh as referenced in Section 1.3 hey hey hey

1.2 Lorem Ipsumus

Blah blah blah
"""

rx = re.compile(r'''
    ^
    (?:Section\ )?\d+\.\d+
    [\s\S]*?
    (?=^(?:Section\ )?\d+\.\d+|\Z)

    ''', re.VERBOSE | re.MULTILINE)

parts = [match.group(0) for match in rx.finditer(data)]
print(parts)

Upvotes: 3

izxle
izxle

Reputation: 405

I suggest you try regex101.com, it will help you visualize your regex. Also, the documentation for re is very useful to learn (or remember) how the special characters work.

With your example I'd use this regex (with named groups):

(?P<section_number>\d\.\d) (?P<section_title>[\w ]+)\n\n\s*(?P<body>.+?)\s*(?=\d\.\d[\w ]+|$)

Breaking it down:

For the section number and title I used named groups (?P<section_number>\d\.\d) and (?P<section_title>[\w ]+) separated by a space.

The body (?P<body>.+?) is follwed by the positive lookahead (?=\d\.\d[\w ]+|$). This means that it will stop capturing text when another section is about to begin or when the document ends. It needs to be nongreedy (+?) or you'll en up with just one section and the rest of the document as the body.

NOTE: you need to enable re.DOTALL when you compile or search for matches or the point will not match new line characters.

If you want the section title to match the beggining of a string you can also add a ^ to the lookahead but you need to enable re.MULTILINE. You'd also have to change the $ at the end to \Z so it matches only the end of the document and not the end of every line.

(?P<section_number>\d\.\d) (?P<section_title>[\w ]+)\n\n\s*(?P<body>.+?)\s*(?=^\d\.\d[\w ]+|\Z)

Upvotes: 1

Related Questions