qed
qed

Reputation: 23104

Sequentially grouping html content by tag

I have a html file similar to the following:

    <h2>section 1</h2>
    <p>para 1</p>
    <p>para 2</p>
    <p>para 3</p>
    <h2>section 2</h2>
    <p>para 1</p>
    <p>para 2</p>
    <p>para 3</p>
    <h2>section 3</h2>
    <p>para 1</p>
    <p>para 2</p>
    <p>para 3</p>

I would like to scrape those into a python dictionary: {'section1':'...', 'section2':'...', 'section3':'...'}, of course I can set a current_section variable and use a while loop, but is there a module for this purpose? I have checked out BeautifulSoup but didn't find a shortcut there.

Thanks!

Upvotes: 2

Views: 458

Answers (2)

schesis
schesis

Reputation: 59128

As far as I know there's nothing along the lines of soup.group_by_header(), but for the input you describe, what you want is fairly straightforward to achieve in any case:

>>> from bs4 import BeautifulSoup     
>>> html = """
...     <h2>section 1</h2>
...     <p>para 1</p>
...     <!-- etc. -->
... """
>>> soup = BeautifulSoup(html)
>>> sections = {}
>>> for header in soup("h2"):
...     paras = []
...     for sibling in header.find_next_siblings(text=False):
...         if sibling.name == "h2":
...             break
...         paras.append(sibling.string)
...     sections[header.string] = paras
... 
>>> sections
{u'section 1': [u'para 1', u'para 2', u'para 3'],
 u'section 2': [u'para 1', u'para 2', u'para 3'],
 u'section 3': [u'para 1', u'para 2', u'para 3']}
>>> 

Is that approach problematic for some reason, or were you just wondering whether there's some clever BeautifulSoup method kicking around that suits (and to be fair, there are a few of those)?

Upvotes: 1

Nat Knight
Nat Knight

Reputation: 374

I think you want the string builtin's split method. If the text you've got there is in html_string you can do

sections = html_string.split('<h2>')  #this deletes the opening h2 tag
for section in sections:
    section = '<h2>' + section   #replace the opening h2 tag
    #code to parse each section goes here

That should be much cleaner than using a while loop.

Upvotes: 0

Related Questions