Reputation: 98
I'm using python + beautifulsoup to try to get the text between the br's. The closest I got to this was by using next_sibling in the following manner:
<html>
<body>
</a><span class="strong">Title1</span>
<p>Text1</p>
<br>The Text I want to get<br>
<p>Text I dont want</p>
</body>
</html>
for span in soup.findAll("span", {"class" : "strong"}):
print(span.next_sibling.next_sibling.text)
But this prints:
The Text I want to getText I dont want
So what i want is after the first p, but before the second, but I can't figure out how to extract when there are no real tags, and only just the br's as references.
I need it to print:
The Text I want to get
Upvotes: 1
Views: 7375
Reputation: 474211
Since the HTML you've provided is broken, the behavior would differ from parser to parser that BeautifulSoup
uses.
In case of lxml
parser, BeautifulSoup
would convert the br
tag into a self-closing one:
>>> soup = BeautifulSoup(data, 'lxml')
>>> print soup
<html>
<body>
<span class="strong">Title1</span>
<p>Text1</p>
<br/>The Text I want to get<br/>
<p>Text I dont want</p>
</body>
</html>
Note that you would need lxml
to be installed. If it is okay for you - find the br
and get the next sibling:
from bs4 import BeautifulSoup
data = """your HTML"""
soup = BeautifulSoup(data, 'lxml')
print(soup.br.next_sibling) # prints "The Text I want to get"
Also see:
Upvotes: 4
Reputation: 391
Using Python Scrapy
In [4]: hxs.select('//body/text()').extract()
Out[4]: [u'\n', u'\n', u'\n', u'The Text I want to get', u'\n', u'\n']
Upvotes: 0