Extracting text between
with beautifulsoup, but without next tag

Question

I'm using python + beautifulsoup to try to get the text between the br's. The closest I got to this was by using next_sibling in the following manner:



Title1
Text1

The Text I want to get

Text I dont want



for span in soup.findAll("span", {"class" : "strong"}):
    print(span.next_sibling.next_sibling.text)

But this prints:

The Text I want to getText I dont want

So what i want is after the first p, but before the second, but I can't figure out how to extract when there are no real tags, and only just the br's as references.

I need it to print:

The Text I want to get

alecxe · Accepted Answer

Since the HTML you've provided is broken, the behavior would differ from parser to parser that BeautifulSoup uses.

In case of lxml parser, BeautifulSoup would convert the br tag into a self-closing one:

>>> soup = BeautifulSoup(data, 'lxml')
>>> print soup


Title1
Text1

The Text I want to get

Text I dont want

Note that you would need lxml to be installed. If it is okay for you - find the br and get the next sibling:

from bs4 import BeautifulSoup

data = """your HTML"""
soup = BeautifulSoup(data, 'lxml')

print(soup.br.next_sibling)  # prints "The Text I want to get"

Also see:

Extracting text between <br> with beautifulsoup, but without next tag

Answers (2)

Related Questions

Extracting text between &lt;br&gt; with beautifulsoup, but without next tag

Answers (2)

Related Questions

Extracting text between <br> with beautifulsoup, but without next tag