Extract content between two tags at same sibling level

Question

I'm trying to collect the content between two tags at the same level, in this case the content between the two h2 tags below:

Learning Outcomes



On successful completion of this unit, you will beable to:





Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience




Prior knowledge

Ideally, I would like the output as below (i.e., ideally the text in the would be ignored, but I'm ok with it sticking around):

Plan for and be active in your own learning...
Reflect on your knowledge of teaching and yourself...
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience

This is what I have so far;

soup = BeautifulSoup(text)
output = ""
unitLO = soup.find(id="learning-outcomes")
tagBreak = unitLO.name
if unitLO:
    # we will loop until we hit the next tag with the same name as the
    # matched tag. eg if unitLO matches an H3, then all content up till the
    # next H3 is captured.
    for tag in unitLO.next_siblings:
        if tag.name == tagBreak:
            break
        else:
            output += str(tag)

print(output)

Which gives the following output, which is a string;

>>> type(output)

>>>





On successful completion of this unit, you will beable to:





Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience

Which is somewhat different from what I want...

The only solution I've come up with is to push output through another round of BeautifulSoup parsing:

>>> moresoup = BeautifulSoup(output)
>>> for str in moresoup.strings:
...     print(str)
...






On successful completion of this unit, you will beableto:












Plan for and be active in your own learning...


Reflect on your knowledge of yourself....


Articulate your informed understanding of the foundations...


Demonstrate information literacy skills


Communicate in writing for an academic audience










>>>

Which is really inelegant, and leads to a lot of whitespace (which of course is easy to clean up).

Any thoughts on a more elegant way of doing this?

Many thanks!

Rakesh · Accepted Answer

Try using soup.find_all to get all p tags

Ex:

from bs4 import BeautifulSoup
s = """Learning Outcomes



On successful completion of this unit, you will beable to:





Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience




Prior knowledge"""

soup = BeautifulSoup(s, "html.parser")
for p in soup.find(id="learning-outcomes").findNext("table").find_all("p"):
    print(p.text)

Output:

Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience

Extract content between two tags at same sibling level

Answers (2)

Related Questions