JohnJ
JohnJ

Reputation: 1

Python/Beautiful Soup find particular heading output full div

I'm attempting to parse a very extensive HTML document looks something like:

<div class="reportsubsection n" ><br>
   <h2> part 1 </h2><br>
   <p> insert text here </p><br>
  <table> crazy table thing here </table><br>
</div>
<div class="reportsubsection n"><br>
   <h2> part 2 </h2><br>
   <p> insert text here </p><br>
   <table> crazy table thing here </table><br>
</div>

Need to parse out the second div based on h2 having text "Part 2". Iwas able to break out all divs with:

divTag = soup.find("div", {"id": "reportsubsection"})

but didn't know how to dwindle it down from there. Other posts I found I was able to find the specific text "part 2 but I need to be able to output the whole DIV section it is contained in.

EDIT/UPDATE

Ok sorry but I'm still a little lost. Here is what I've got now. I feel like this should be so much simpler than I'm making it. Thanks again for all the help

divTag = soup.find("div", {"id": "reportsubsection"})<br>
for reportsubsection in soup.select('div#reportsubsection #reportsubsection'):<br>
    if not reportsubsection.findAll('h2', text=re.compile('Finding')):<br>
        continue<br>
print divTag

Upvotes: 0

Views: 3840

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122282

You can always go back up after finding the right h2, or you can test all subsections:

for subsection in soup.select('div#reportsubsection #subsection'):
    if not subsection.find('h2', text=re.compile('part 2')):
        continue
    # do something with this subsection

This uses a CSS selector to locate all subsections.

Or, going back up with the .parent attribute:

for header in soup.find_all('h2', text=re.compile('part 2')):
    section = header.parent

The trick is to narrow down your search as early as possible; the second option has to find all h2 elements in the whole document, while the former narrows the search down quicker.

Upvotes: 2

Related Questions