Reputation: 1511
Here's a snippet of a "real-world" HTML file I'm trying to scrape with BeautifulSoup4 (Python 3) using the xml
parser (the other parsers don't work with the kind of dirty html files I'm working with):
<html>
<p> Hello </p>
<a name='One'>Item One</a>
<p> Text that I would like to scrape. </p>
<p> More text I would like to scrape.
<table>
<tr>
<td>
<a name='Two'>Item Two</a>
</td>
</tr>
</table>
A bunch of text that shouldn't be scraped.
More text.
And more text.
</p>
</html>
My goal is to scrape all the text sitting between <a name='One'>Item One</a>
and <a name='Two'>Item Two</a>
without scraping the 3 lines of text in the last <p>
.
I've attempted trying to traverse from the first <a>
tag using the find_next()
function and then invoking get_text()
, but what happens when I hit the last <p>
is that the text at the end also gets scraped, which isn't what I want.
Sample code:
tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})
found = False
tag = tag_one
while found == False:
tag = tag.find_next()
if tag == tag_two:
found = True
print(tag.get_text())
Any ideas on how to solve this?
Upvotes: 0
Views: 108
Reputation: 1511
I came up with a more robust way:
soup = BeautifulSoup(html, 'xml')
tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})
for tag in tag_one.next_elements:
if type(tag) is not bs4.element.Tag:
print(tag)
if tag is tag_two:
break
Upvotes: 1
Reputation: 15376
You could use the find_all_next
method to iterate over the next tags, and get a list of strings for each tag with the strings
generator.
soup = BeautifulSoup(html, 'xml')
tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})
text = None
for tag in tag_one.find_all_next():
if tag is tag_two:
break
strings = list(tag.stripped_strings)
if strings and strings[0] != text:
text = strings[0]
print(text)
Upvotes: 1