Reputation: 1
I'm attempting to parse a very extensive HTML document looks something like:
<div class="reportsubsection n" ><br>
<h2> part 1 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
<div class="reportsubsection n"><br>
<h2> part 2 </h2><br>
<p> insert text here </p><br>
<table> crazy table thing here </table><br>
</div>
Need to parse out the second div based on h2 having text "Part 2". Iwas able to break out all divs with:
divTag = soup.find("div", {"id": "reportsubsection"})
but didn't know how to dwindle it down from there. Other posts I found I was able to find the specific text "part 2 but I need to be able to output the whole DIV section it is contained in.
EDIT/UPDATE
Ok sorry but I'm still a little lost. Here is what I've got now. I feel like this should be so much simpler than I'm making it. Thanks again for all the help
divTag = soup.find("div", {"id": "reportsubsection"})<br>
for reportsubsection in soup.select('div#reportsubsection #reportsubsection'):<br>
if not reportsubsection.findAll('h2', text=re.compile('Finding')):<br>
continue<br>
print divTag
Upvotes: 0
Views: 3840
Reputation: 1122282
You can always go back up after finding the right h2
, or you can test all subsections:
for subsection in soup.select('div#reportsubsection #subsection'):
if not subsection.find('h2', text=re.compile('part 2')):
continue
# do something with this subsection
This uses a CSS selector to locate all subsection
s.
Or, going back up with the .parent
attribute:
for header in soup.find_all('h2', text=re.compile('part 2')):
section = header.parent
The trick is to narrow down your search as early as possible; the second option has to find all h2
elements in the whole document, while the former narrows the search down quicker.
Upvotes: 2