UdaraW
UdaraW

Reputation: 98

Extracting text between <br> with beautifulsoup, but without next tag

I'm using python + beautifulsoup to try to get the text between the br's. The closest I got to this was by using next_sibling in the following manner:

<html>
<body>
</a><span class="strong">Title1</span>
<p>Text1</p>
<br>The Text I want to get<br>
<p>Text I dont want</p>
</body>
</html>

for span in soup.findAll("span", {"class" : "strong"}):
    print(span.next_sibling.next_sibling.text)

But this prints:

The Text I want to getText I dont want

So what i want is after the first p, but before the second, but I can't figure out how to extract when there are no real tags, and only just the br's as references.

I need it to print:

The Text I want to get

Upvotes: 1

Views: 7375

Answers (2)

alecxe
alecxe

Reputation: 474211

Since the HTML you've provided is broken, the behavior would differ from parser to parser that BeautifulSoup uses.

In case of lxml parser, BeautifulSoup would convert the br tag into a self-closing one:

>>> soup = BeautifulSoup(data, 'lxml')
>>> print soup
<html>
<body>
<span class="strong">Title1</span>
<p>Text1</p>
<br/>The Text I want to get<br/>
<p>Text I dont want</p>
</body>
</html>

Note that you would need lxml to be installed. If it is okay for you - find the br and get the next sibling:

from bs4 import BeautifulSoup

data = """your HTML"""
soup = BeautifulSoup(data, 'lxml')

print(soup.br.next_sibling)  # prints "The Text I want to get"

Also see:

Upvotes: 4

Anandhakumar R
Anandhakumar R

Reputation: 391

Using Python Scrapy

In [4]: hxs.select('//body/text()').extract()
Out[4]: [u'\n', u'\n', u'\n', u'The Text I want to get', u'\n', u'\n']

Upvotes: 0

Related Questions