Using BeautifulSoup4 to retrieve text between 2 tags at different levels

Question

Here's a snippet of a "real-world" HTML file I'm trying to scrape with BeautifulSoup4 (Python 3) using the xml parser (the other parsers don't work with the kind of dirty html files I'm working with):


     Hello 
    Item One
     Text that I would like to scrape. 
     More text I would like to scrape.
        

            
                
                    Item Two
                
            
        
        A bunch of text that shouldn't be scraped.
        More text.
        And more text.

My goal is to scrape all the text sitting between Item One and Item Two without scraping the 3 lines of text in the last

.

I've attempted trying to traverse from the first tag using the find_next() function and then invoking get_text(), but what happens when I hit the last

is that the text at the end also gets scraped, which isn't what I want.

Sample code:

tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})
found = False
tag = tag_one
while found == False:
    tag = tag.find_next()
    if tag == tag_two:
        found = True
    print(tag.get_text())

Any ideas on how to solve this?

nwly · Accepted Answer

I came up with a more robust way:

soup = BeautifulSoup(html, 'xml')
tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})

for tag in tag_one.next_elements:
    if type(tag) is not bs4.element.Tag:
        print(tag)
    if tag is tag_two:
        break

Using BeautifulSoup4 to retrieve text between 2 tags at different levels

Answers (2)

Related Questions