Reputation: 2123
The documents I have have sections that are nicely denoted by titles. I want to split the documents into sections using these titles. Example:
1.1 Lorem Ipsum
Blah blah blah
9 (page break, never will have a period in it though)
Bleh bleh bleh as referenced in Section 1.3 hey hey hey
1.2 Lorem Ipsumus
Blah blah blah
I would like a regex expression that can take the titles and the text until the next title appears. So the desired results for the example would be,
1.1 Lorem Ipsum Blah blah blah bleh bleh bleh as referenced in Section 1.3 hey hey hey
And
1.2 Lorem Ipsumus Blah blah blah
One thing I can always count on is that the section titles will be a new line that starts with some sort of number x.x followed by a few words and since that is pretty unique to the titles, that what I would like to search on.
Basically if I see anything that is a new line and of the form "Section 1.2 Definitions" I know that is a new section and would like to grab all the text from there until the next new line that starts with "Section 1.3 Examples" or perhaps "Section 2.1 Terms". Section titles always start a new line and are of the form "Section 1.3 Examples", "Article 1.3 Examples", or "1.3 Examples".
Sometimes there are references to titles in the middle of a line and these I would like to ignore. This can be seen in the example.
Does anyone know how to do this? Preferebly in python but the regex should be sufficient if not.
p.s. Keeping the page numbers or not is optional but the regex would ideally not create new sections based on the page numbers
EDIT: So far, here is the MWE I have running. Its not quite there.
import re
doc_splitter = re.compile(r"(?<=\n)(?P<secname>[\w]+ )(\d+\.\d+ .*?)(?<=\n)(?P<secname2>[\w]+ )(?=\d+\.\d+|\Z)", re.DOTALL)
text = """
Section 1.1 Lorem Ipsum
Blah blah blah
9
Bleh bleh bleh Section 1.1 hey hey hey
Section 1.2 Lorem Ipsumus
ref Section 1.3
Blah blah blah
Section 1.3 hey hey
Section 1.4
"""
for match in doc_splitter.finditer(text):
print([match.group()])
Ideally it would return:
['Section 1.1 Lorem Ipsum Blah blah blah 9 Bleh bleh bleh Section 1.1 hey hey hey']
['Section 1.2 Lorem Ipsumus ref Section 1.3 Blah blah blah']
['Section 1.3 hey hey']
['Section 1.4']
But instead it returns:
['Section 1.1 Lorem Ipsum\n\nBlah blah blah\n9\nBleh bleh bleh Section 1.1 hey hey hey\n\nSection ']
['Section 1.3 hey hey\n\nSection ']
Thanks for all the help everyone! If anyone has any thoughts on how to get this last problem fixed it would be very appreciated.
Upvotes: 1
Views: 2985
Reputation: 20147
The regex you are looking for might be similar to this:
doc_splitter = re.compile(r"(?<=\n)(\d+\.\d+ .*?)(?<=\n)(?=\d+\.\d+|$)", re.DOTALL)
, which, given python code, can be run on the whole document with finditer
:
text = """
1.1 Lorem Ipsum
Blah blah blah
9 (page break, never will have a period in it though)
Bleh bleh bleh
1.2 Lorem Ipsumus
Blah blah blah"""
for match in doc_splitter.finditer(text):
print([match.group()]) # print in list to suppress \n interpretation
Prints:
['1.1 Lorem Ipsum\n\nBlah blah blah\n9 (page break, never will have a period in it though)\nBleh bleh bleh\n\n']
['1.2 Lorem Ipsumus\n\nBlah blah blah\n']
which seems to be what you want.
If you iterate the data differently you might be able to get rid of the cumbersome lookaround assertions, which might not cleanly translate into other languages that demand constant length lookarounds. The core is given with (\d+\.\d+ .*?)
and forcing a full match.
Jan's answer is good, but I also wanted to add a solution that solves the problem without lookahead conditions, since they look redundant:
import re
doc_splitter = re.compile(r"^(?:Section\ )?\d+\.\d+", re.MULTILINE)
text = """
Section 1.1 Lorem Ipsum
Blah blah blah
9
Bleh bleh bleh Section 1.1 hey hey hey
Section 1.2 Lorem Ipsumus
ref Section 1.3
Blah blah blah
Section 1.3 hey hey
Section 1.4
"""
starts = [match.span()[0] for match in doc_splitter.finditer(text)] + [len(text)]
sections = [text[starts[idx]:starts[idx+1]] for idx in range(len(starts)-1)]
for section in sections:
print([section])
Prints:
['Section 1.1 Lorem Ipsum\n\nBlah blah blah\n9\nBleh bleh bleh Section 1.1 hey hey hey\n\n']
['Section 1.2 Lorem Ipsumus \nref Section 1.3\n\nBlah blah blah\n\n']
['Section 1.3 hey hey\n\n']
['Section 1.4\n\n']
The regex only searches for the start of a new section, and should be easy enough to maintain and extend. We have to go through the additional step of splitting the text
by hand from each new start, which serves as the ending for the former section.
While a regex is perfectly capable of handling this kind of matching in a single step, I personally prefer to keep them as short as possible. They are difficult enough to understand already.
Upvotes: 1
Reputation: 43169
Just to put my two cents in - you could use
^
(?:Section\ )?\d+\.\d+
[\s\S]*?
(?=^(?:Section\ )?\d+\.\d+|\Z)
with the verbose
and multiline
modifier, see a demo on regex101.com.
Python
:
import re
data = """
1.1 Lorem Ipsum
Blah blah blah
9 (page break, never will have a period in it though)
Bleh bleh bleh as referenced in Section 1.3 hey hey hey
1.2 Lorem Ipsumus
Blah blah blah
"""
rx = re.compile(r'''
^
(?:Section\ )?\d+\.\d+
[\s\S]*?
(?=^(?:Section\ )?\d+\.\d+|\Z)
''', re.VERBOSE | re.MULTILINE)
parts = [match.group(0) for match in rx.finditer(data)]
print(parts)
Upvotes: 3
Reputation: 405
I suggest you try regex101.com, it will help you visualize your regex. Also, the documentation for re is very useful to learn (or remember) how the special characters work.
With your example I'd use this regex (with named groups):
(?P<section_number>\d\.\d) (?P<section_title>[\w ]+)\n\n\s*(?P<body>.+?)\s*(?=\d\.\d[\w ]+|$)
Breaking it down:
For the section number and title I used named groups (?P<section_number>\d\.\d)
and (?P<section_title>[\w ]+)
separated by a space.
The body (?P<body>.+?)
is follwed by the positive lookahead (?=\d\.\d[\w ]+|$)
. This means that it will stop capturing text when another section is about to begin or when the document ends. It needs to be nongreedy (+?
) or you'll en up with just one section and the rest of the document as the body.
NOTE: you need to enable re.DOTALL
when you compile or search for matches or the point will not match new line characters.
If you want the section title to match the beggining of a string you can also add a ^
to the lookahead but you need to enable re.MULTILINE
. You'd also have to change the $
at the end to \Z
so it matches only the end of the document and not the end of every line.
(?P<section_number>\d\.\d) (?P<section_title>[\w ]+)\n\n\s*(?P<body>.+?)\s*(?=^\d\.\d[\w ]+|\Z)
Upvotes: 1