BS4 Grabbing Text in Between
Tags that Follow Pattern

Question

I'm trying to scrape a site using in python using BS4 that follows this pattern:


Text 1


Text 2


Text 3


Text 4

The code I wrote to do this skips "Text 1" and "Text 4":

            for br in scraper.findAll('br'):
                next_s = br.nextSibling
                if not (next_s and isinstance(next_s,NavigableString)):
                    continue
                next2_s = next_s.nextSibling
                if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
                    text = str(next_s).strip()
                    if text:
                        wanted_text = next_s.split('Text ')[1]

I understand that the reason why it's not grabbing the first and last text in the p tag is because of my second if statement therefore I'm trying to figure out if there's a different way to parse this.

Once I'm able to grab each "Text 1" string I use regex to parse through each one to grab what I actually need so the desired output from this code would be next_s = "Text 1"

Andrej Kesely · Accepted Answer

For these kind of tasks you can use .get_text() with separator= parameter, then split on this separator:

from bs4 import BeautifulSoup
    
html_doc = """

Text 1


Text 2


Text 3


Text 4

"""

soup = BeautifulSoup(html_doc, "html.parser")

texts = soup.find("p").get_text(strip=True, separator="|").split("|")  # use separator not included in the text
print(texts)

Prints:

['Text 1', 'Text 2', 'Text 3', 'Text 4']

To get only first text:

print(texts[0])

Prints:

Text 1

Or: Use .find_all() with text=True:

texts = [t.strip() for t in soup.find("p").find_all(text=True, recursive=False)]
print(texts)

BS4 Grabbing Text in Between <p> Tags that Follow Pattern

Answers (1)

Related Questions

BS4 Grabbing Text in Between &lt;p&gt; Tags that Follow Pattern

Answers (1)

Related Questions

BS4 Grabbing Text in Between <p> Tags that Follow Pattern