nwly
nwly

Reputation: 1511

Using BeautifulSoup4 to retrieve text between 2 tags at different levels

Here's a snippet of a "real-world" HTML file I'm trying to scrape with BeautifulSoup4 (Python 3) using the xml parser (the other parsers don't work with the kind of dirty html files I'm working with):

<html>
    <p> Hello </p>
    <a name='One'>Item One</a>
    <p> Text that I would like to scrape. </p>
    <p> More text I would like to scrape.
        <table>
            <tr>
                <td>
                    <a name='Two'>Item Two</a>
                </td>
            </tr>
        </table>
        A bunch of text that shouldn't be scraped.
        More text.
        And more text.
    </p>
</html>

My goal is to scrape all the text sitting between <a name='One'>Item One</a> and <a name='Two'>Item Two</a> without scraping the 3 lines of text in the last <p>.

I've attempted trying to traverse from the first <a> tag using the find_next() function and then invoking get_text(), but what happens when I hit the last <p> is that the text at the end also gets scraped, which isn't what I want.

Sample code:

tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})
found = False
tag = tag_one
while found == False:
    tag = tag.find_next()
    if tag == tag_two:
        found = True
    print(tag.get_text())

Any ideas on how to solve this?

Upvotes: 0

Views: 108

Answers (2)

nwly
nwly

Reputation: 1511

I came up with a more robust way:

soup = BeautifulSoup(html, 'xml')
tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})

for tag in tag_one.next_elements:
    if type(tag) is not bs4.element.Tag:
        print(tag)
    if tag is tag_two:
        break

Upvotes: 1

t.m.adam
t.m.adam

Reputation: 15376

You could use the find_all_next method to iterate over the next tags, and get a list of strings for each tag with the strings generator.

soup = BeautifulSoup(html, 'xml')
tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})
text = None

for tag in tag_one.find_all_next():
    if tag is tag_two:
        break
    strings = list(tag.stripped_strings)
    if strings and strings[0] != text:
        text = strings[0]
        print(text)

Upvotes: 1

Related Questions