p192
p192

Reputation: 528

Using Regular Expressions to parse Table of Contents

I am trying to parse a table of contents using regular expressions. Typically the extracted lines look as follows:

4.1.2 Administrative ............................................................................................................... I-i

Or spaces instead of dots. What I need to parse are the section number - in this case 4.1.2 - the section name - Administrative - and the page number - in this case I-i.

For the section number so far I have:

.*(?=[.]{3,}|[ ]{3,})

But of course this returns 4.1.2 Administrative. First, I need some sort of lookbehind assertion that looks for the following possible combinations:

  1. X, X., X.X, X.X., X.X.X, etc where X represents a digit

From there I would get the section name, and then I'd need another RE to get the section numbers X, X., X.X, X.X., etc. Finally I would need one RE to get the page number. I thought of doing something like:

(?<=[.]{3,}|[ ]{3,}).*

But of course the above RE would not work since look-behinds do not support quantifiers such as {3,}

What I want to end up with is

sectionNumber = "4.1.2"
sectionName = "Administrative"
pageNumber = "I-i"

Any help with any of these would be appreciated.

Upvotes: 2

Views: 556

Answers (1)

Kasravnd
Kasravnd

Reputation: 107357

Here is one possible regex that you can use with re.findall and a case insensitive flag:

r'^([\d.]{1,5})\s([a-z]+)\s\.{10,}\s([\w-]+)'

Demo:

In [39]: re.findall(r'^([\d.]{1,5})\s([a-z]+)\s\.{10,}\s([\w-]+)', s, re.I)
Out[39]: [('4.1.2', 'Administrative', 'I-i')]

Note-1: The middle dots are caught by \.{10,} which you can change the number of repetition to a more proper one if you like. Also the trailing number is matched with [\w-]+ that means it can match any combination of word characters (digits and alphabetical) with -. I'd suggest you to use a set of proper alphabetical characters that you're sure are presented at that part plus - inside the character class.

Note-2: Regarding ([\d.]{1,5}) for matching the section number if you think its possible to have a several-digit number inside it I'd suggest to use ([\d.]+) or ([\d.]{1,8}) with a higher range number.

Upvotes: 2

Related Questions