Reputation: 165
I would like to get all text between two tags:
<div class="lead">I DONT WANT this</div>
#many different tags - p, table, h2 including text that I want
<div class="image">...</div>
I started this way:
url = "http://......."
req = urllib.request.Request(url)
source = urllib.request.urlopen(req)
soup = BeautifulSoup(source, 'lxml')
start = soup.find('div', {'class': 'lead'})
end = soup.find('div', {'class': 'image'})
And I have no idea what to do next
Upvotes: 8
Views: 2938
Reputation: 434
Try this code, it let's the parser start at class lead and exits the programm when hitting class image and prints all available tags, this can be changed to printing entire code:
html = u""
for tag in soup.find("div", { "class" : "lead" }).next_siblings:
if soup.find("div", { "class" : "image" }) == tag:
break
else:
html += unicode(tag)
print html
Upvotes: 1
Reputation: 403
try using the code below:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<html><div class="lead">lead</div>data<div class="end"></div></html>"
""", "lxml")
node = soup.find('div', {'class': 'lead'})
s = []
while True:
if node is None:
break
node = node.next_sibling
if hasattr(node, "attrs") and ("end" in node.attrs['class'] ):
break
else:
if node is not None:
s.append(node)
print s
using next_sibling to get the brother node.
Upvotes: 0