Reputation: 15101
I am trying to extract quotes from 2012 Obama-Romney presidential debate. Problem is the site is not well organized. So the structure looks like this:
<span class="displaytext">
<p>
<i>OBAMA</i>Obama's first quotes
</p>
<p>More quotes from Obama</p>
<p>Some more Obama quotes</p>
<p>
<i>Moderator</i>Moderator's quotes
</p>
<p>Some more quotes</p>
<p>
<i>ROMNEY</i>Romney's quotes
</p>
<p>More quotes from Romney</p>
<p>Some more Romney quotes</p>
</span>
Is there a way to select a <p>
whose first child is an i
that has the text OBAMA
AND all it's p
siblings UNTIL you hit the next p
whose first child is an i
that does not have the text Obama
??
Here is what I tried so far, but it is only grabbing the first p
ignoring the siblings
input = '''<span class="displaytext">
<p>
<i>OBAMA</i>Obama's first quotes
</p>
<p>More quotes from Obama</p>
<p>Some more Obama quotes</p>
<p>
<i>Moderator</i>Moderator's quotes
</p>
<p>Some more quotes</p>
<p>
<i>ROMNEY</i>Romney's quotes
</p>
<p>More quotes from Romney</p>
<p>Some more Romney quotes</p>
</span>'''
soup = BeautifulSoup(input)
debate_text = soup.find("span", { "class" : "displaytext" })
president_quotes = debate_text.find_all("i", text="OBAMA")
for i in president_quotes:
siblings = i.next_siblings
for sibling in siblings:
print(sibling)
Which only prints Obama's first quotes
Upvotes: 2
Views: 60
Reputation: 63
The other Obama quotes are siblings of the p
, not the i
, so you'll need to find the siblings of i
's parent. As you're looping through those siblings, you can stop when one has an i
. Something like this:
for i in president_quotes:
print(i.next_sibling)
siblings = i.parent.find_next_siblings('p')
for sibling in siblings:
if sibling.find("i"):
break
print(sibling.string)
which prints:
Obama's first quotes
More quotes from Obama
Some more Obama quotes
Upvotes: 2
Reputation: 8047
I think a kind of finite state machine-like solution will work here. Like this:
soup = BeautifulSoup(input, 'lxml')
debate_text = soup.find("span", { "class" : "displaytext" })
obama_is_on = False
obama_tags = []
for p in debate_text("p"):
if p.i and 'OBAMA' in p.i:
# assuming <i> is used only to indicate speaker
obama_is_on = True
if p.i and 'OBAMA' not in p.i:
obama_is_on = False
continue
if obama_is_on:
obama_tags.append(p)
print(obama_tags)
[<p>
<i>OBAMA</i>Obama's first quotes
</p>, <p>More quotes from Obama</p>, <p>Some more Obama quotes</p>]
Upvotes: 2