Automatically generate nested table of contents based on heading tags using python

Question

I'm trying to create a nested table of content based on heading tags of HTML.

My HTML file:



  


  
            My report Name
  
  First Chapter                          
   First Sub-chapter of the first chapter
  
    Useless h1
    
      some text
    
  
  Second Sub-chapter of the first chapter 
  
    Useless h1
    
      some text
    
  
  Second Chapter                          
  First Sub-chapter of the Second chapter 
  
    Useless h1
    
      some text
    
  
  Second Sub-chapter of the Second chapter 
  
    Useless h1
    
      some text

My python code:

import from lxml import html
from bs4 import BeautifulSoup as soup
import re
import codecs
#Access to the local URL(Html file)
f = codecs.open("C:\x\test.html", 'r')
page = f.read()
f.close()
#html parsing
page_soup = soup(page,"html.parser")
tree = html.fromstring(page)#extract report name
ref = page_soup.find("h1",{"id": False}).text.strip()
print("the name of the report is : " + ref + " 
")

chapters = page_soup.findAll('h1', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(chapters)) + " chapter(s)")
for index, chapter in enumerate(chapters):
    print(str(index+1) +"-" + str(chapter.text.strip()) + "
")

sub_chapters = page_soup.findAll('h2', attrs={'id': re.compile("^[0-9]*$")})
print("We have " + str(len(sub_chapters)) + " sub_chapter(s)")
for index, sub_chapter in enumerate(sub_chapters):
    print(str(index+1) +"-" +str(sub_chapter.text.strip()) + "
")

With this code, I am able to get all the chapters and all the sub-chapters but it is not my goal.

My goal is to get the below as my table of contents:

1-First Chapter
    1-First sub-chapter of the first chapter
    2-Second sub-chapter of the first chapter
2-Second Chapter    
    1-First sub-chapter of the Second chapter
    2-Second sub-chapter of the Second chapter

Any recommendation or ideas on how to achieve my desired table of contents format?

Ajax1234 · Accepted Answer

You can use itertools.groupby after finding all the data associated with each chapter:

from itertools import groupby, count
import re
from bs4 import BeautifulSoup as soup
data = [[i.name, re.sub('\s+$', '', i.text)] for i in soup(content, 'html.parser').find_all(re.compile('h1|h2'), {'id':re.compile('^\d+$')})]
grouped, _count = [[a, list(b)] for a, b in groupby(data, key=lambda x:x[0] == 'h1')], count(1)
new_grouped = [[grouped[i][-1][0][-1], [c for _, c in grouped[i+1][-1]]] for i in range(0, len(grouped), 2)]
final_string = '
'.join(f'{next(_count)}-{a}
'+'
'.join(f'	{i}-{c}' for i, c in enumerate(b, 1)) for a, b in new_grouped)
print(final_string)

Output:

1-First Chapter
    1- First Sub-chapter of the first chapter
    2-Second Sub-chapter of the first chapter
2-Second Chapter
    1-First Sub-chapter of the Second chapter
    2-Second Sub-chapter of the Second chapter

Automatically generate nested table of contents based on heading tags using python

Answers (2)

Related Questions