Reputation: 171
I'm trying to collect the content between two tags at the same level, in this case the content between the two h2
tags below:
<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>
Ideally, I would like the output as below (i.e., ideally the text in the <th>
would be ignored, but I'm ok with it sticking around):
Plan for and be active in your own learning...
Reflect on your knowledge of teaching and yourself...
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience
This is what I have so far;
soup = BeautifulSoup(text)
output = ""
unitLO = soup.find(id="learning-outcomes")
tagBreak = unitLO.name
if unitLO:
# we will loop until we hit the next tag with the same name as the
# matched tag. eg if unitLO matches an H3, then all content up till the
# next H3 is captured.
for tag in unitLO.next_siblings:
if tag.name == tagBreak:
break
else:
output += str(tag)
print(output)
Which gives the following output, which is a string;
>>> type(output)
<class 'str'>
>>>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
Which is somewhat different from what I want...
The only solution I've come up with is to push output
through another round of BeautifulSoup
parsing:
>>> moresoup = BeautifulSoup(output)
>>> for str in moresoup.strings:
... print(str)
...
On successful completion of this unit, you will beableto:
Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience
>>>
Which is really inelegant, and leads to a lot of whitespace (which of course is easy to clean up).
Any thoughts on a more elegant way of doing this?
Many thanks!
Upvotes: 0
Views: 698
Reputation: 4992
change the following code
if unitLO:
# we will loop until we hit the next tag with the same name as the
# matched tag. eg if unitLO matches an H3, then all content up till the
# next H3 is captured.
for tag in unitLO.next_siblings:
if tag.name == tagBreak:
break
else:
if str(tag).strip() != "":
output += str(tag)
print(output)
Upvotes: 0
Reputation: 82765
Try using soup.find_all to get all p
tags
Ex:
from bs4 import BeautifulSoup
s = """<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>"""
soup = BeautifulSoup(s, "html.parser")
for p in soup.find(id="learning-outcomes").findNext("table").find_all("p"):
print(p.text)
Output:
Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience
Upvotes: 2