Reputation: 528
I am trying to parse a table of contents using regular expressions. Typically the extracted lines look as follows:
4.1.2 Administrative ............................................................................................................... I-i
Or spaces instead of dots. What I need to parse are the section number - in this case 4.1.2 - the section name - Administrative - and the page number - in this case I-i.
For the section number so far I have:
.*(?=[.]{3,}|[ ]{3,})
But of course this returns 4.1.2 Administrative. First, I need some sort of lookbehind assertion that looks for the following possible combinations:
From there I would get the section name, and then I'd need another RE to get the section numbers X, X., X.X, X.X., etc. Finally I would need one RE to get the page number. I thought of doing something like:
(?<=[.]{3,}|[ ]{3,}).*
But of course the above RE would not work since look-behinds do not support quantifiers such as {3,}
What I want to end up with is
sectionNumber = "4.1.2"
sectionName = "Administrative"
pageNumber = "I-i"
Any help with any of these would be appreciated.
Upvotes: 2
Views: 556
Reputation: 107357
Here is one possible regex that you can use with re.findall
and a case insensitive flag:
r'^([\d.]{1,5})\s([a-z]+)\s\.{10,}\s([\w-]+)'
Demo:
In [39]: re.findall(r'^([\d.]{1,5})\s([a-z]+)\s\.{10,}\s([\w-]+)', s, re.I)
Out[39]: [('4.1.2', 'Administrative', 'I-i')]
Note-1: The middle dots are caught by \.{10,}
which you can change the number of repetition to a more proper one if you like. Also the trailing number is matched with [\w-]+
that means it can match any combination of word characters (digits and alphabetical) with -
. I'd suggest you to use a set of proper alphabetical characters that you're sure are presented at that part plus -
inside the character class.
Note-2: Regarding ([\d.]{1,5})
for matching the section number if you think its possible to have a several-digit number inside it I'd suggest to use ([\d.]+)
or ([\d.]{1,8})
with a higher range number.
Upvotes: 2