Alek SZ
Alek SZ

Reputation: 165

BeautifulSoup - How to get all text between two different tags?

I would like to get all text between two tags:

<div class="lead">I DONT WANT this</div>

#many different tags - p, table, h2 including text that I want

<div class="image">...</div>

I started this way:

url = "http://......."
req = urllib.request.Request(url)
source = urllib.request.urlopen(req)
soup = BeautifulSoup(source, 'lxml')

start = soup.find('div', {'class': 'lead'})
end = soup.find('div', {'class': 'image'})

And I have no idea what to do next

Upvotes: 8

Views: 2938

Answers (2)

matsbauer
matsbauer

Reputation: 434

Try this code, it let's the parser start at class lead and exits the programm when hitting class image and prints all available tags, this can be changed to printing entire code:

html = u""
for tag in soup.find("div", { "class" : "lead" }).next_siblings:
    if soup.find("div", { "class" : "image" }) == tag:
        break
    else:
        html += unicode(tag)
print html

Upvotes: 1

herokingsley
herokingsley

Reputation: 403

try using the code below:

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
    <html><div class="lead">lead</div>data<div class="end"></div></html>"
    """, "lxml")

node = soup.find('div', {'class': 'lead'})
s = []
while True:
    if node is None:
        break
    node = node.next_sibling
    if hasattr(node, "attrs") and ("end" in node.attrs['class'] ):
        break   
    else:
        if node is not None:
            s.append(node)
print s

using next_sibling to get the brother node.

Upvotes: 0

Related Questions