Reputation: 23104
I have a html file similar to the following:
<h2>section 1</h2>
<p>para 1</p>
<p>para 2</p>
<p>para 3</p>
<h2>section 2</h2>
<p>para 1</p>
<p>para 2</p>
<p>para 3</p>
<h2>section 3</h2>
<p>para 1</p>
<p>para 2</p>
<p>para 3</p>
I would like to scrape those into a python dictionary: {'section1':'...', 'section2':'...', 'section3':'...'}
, of course I can set a current_section
variable and use a while loop, but is there a module for this purpose?
I have checked out BeautifulSoup but didn't find a shortcut there.
Thanks!
Upvotes: 2
Views: 458
Reputation: 59128
As far as I know there's nothing along the lines of soup.group_by_header()
, but for the input you describe, what you want is fairly straightforward to achieve in any case:
>>> from bs4 import BeautifulSoup
>>> html = """
... <h2>section 1</h2>
... <p>para 1</p>
... <!-- etc. -->
... """
>>> soup = BeautifulSoup(html)
>>> sections = {}
>>> for header in soup("h2"):
... paras = []
... for sibling in header.find_next_siblings(text=False):
... if sibling.name == "h2":
... break
... paras.append(sibling.string)
... sections[header.string] = paras
...
>>> sections
{u'section 1': [u'para 1', u'para 2', u'para 3'],
u'section 2': [u'para 1', u'para 2', u'para 3'],
u'section 3': [u'para 1', u'para 2', u'para 3']}
>>>
Is that approach problematic for some reason, or were you just wondering whether there's some clever BeautifulSoup method kicking around that suits (and to be fair, there are a few of those)?
Upvotes: 1
Reputation: 374
I think you want the string
builtin's split
method. If the text you've got there is in html_string
you can do
sections = html_string.split('<h2>') #this deletes the opening h2 tag
for section in sections:
section = '<h2>' + section #replace the opening h2 tag
#code to parse each section goes here
That should be much cleaner than using a while
loop.
Upvotes: 0