Hazem Salama
Hazem Salama

Reputation: 15101

How do you use BeautifulSoup to select a tag depending on its children and siblings?

I am trying to extract quotes from 2012 Obama-Romney presidential debate. Problem is the site is not well organized. So the structure looks like this:

<span class="displaytext">
    <p>
        <i>OBAMA</i>Obama's first quotes
    </p>
    <p>More quotes from Obama</p>
    <p>Some more Obama quotes</p>

    <p>
        <i>Moderator</i>Moderator's quotes
    </p>
    <p>Some more quotes</p>

    <p>
        <i>ROMNEY</i>Romney's quotes
    </p>
    <p>More quotes from Romney</p>
    <p>Some more Romney quotes</p>
</span>

Is there a way to select a <p> whose first child is an i that has the text OBAMA AND all it's p siblings UNTIL you hit the next p whose first child is an i that does not have the text Obama ??

Here is what I tried so far, but it is only grabbing the first p ignoring the siblings

input = '''<span class="displaytext">
        <p>
            <i>OBAMA</i>Obama's first quotes
        </p>
        <p>More quotes from Obama</p>
        <p>Some more Obama quotes</p>

       <p>
           <i>Moderator</i>Moderator's quotes
       </p>
       <p>Some more quotes</p>

       <p>
           <i>ROMNEY</i>Romney's quotes
       </p>
       <p>More quotes from Romney</p>
       <p>Some more Romney quotes</p>
       </span>'''

soup = BeautifulSoup(input)
debate_text = soup.find("span", { "class" : "displaytext" })
president_quotes = debate_text.find_all("i", text="OBAMA")

for i in president_quotes:
    siblings = i.next_siblings
    for sibling in siblings:
        print(sibling)

Which only prints Obama's first quotes

Upvotes: 2

Views: 60

Answers (2)

Joey
Joey

Reputation: 63

The other Obama quotes are siblings of the p, not the i, so you'll need to find the siblings of i's parent. As you're looping through those siblings, you can stop when one has an i. Something like this:

for i in president_quotes:
    print(i.next_sibling)
    siblings = i.parent.find_next_siblings('p')
    for sibling in siblings:
        if sibling.find("i"):
            break
        print(sibling.string)

which prints:

Obama's first quotes

More quotes from Obama
Some more Obama quotes

Upvotes: 2

Ilya V. Schurov
Ilya V. Schurov

Reputation: 8047

I think a kind of finite state machine-like solution will work here. Like this:

soup = BeautifulSoup(input, 'lxml')
debate_text = soup.find("span", { "class" : "displaytext" })
obama_is_on = False
obama_tags = []
for p in debate_text("p"):
    if p.i and 'OBAMA' in p.i:
        # assuming <i> is used only to indicate speaker
        obama_is_on = True
    if p.i and 'OBAMA' not in p.i:
        obama_is_on = False
        continue
    if obama_is_on:
        obama_tags.append(p)
print(obama_tags)

[<p>
<i>OBAMA</i>Obama's first quotes
        </p>, <p>More quotes from Obama</p>, <p>Some more Obama quotes</p>]

Upvotes: 2

Related Questions