SD_23
SD_23

Reputation: 421

BS4 Grabbing Text in Between <p> Tags that Follow Pattern

I'm trying to scrape a site using in python using BS4 that follows this pattern:

<p>
Text 1
<br/>
Text 2
<br/>
Text 3
<br/>
Text 4
</p>

The code I wrote to do this skips "Text 1" and "Text 4":

            for br in scraper.findAll('br'):
                next_s = br.nextSibling
                if not (next_s and isinstance(next_s,NavigableString)):
                    continue
                next2_s = next_s.nextSibling
                if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
                    text = str(next_s).strip()
                    if text:
                        wanted_text = next_s.split('Text ')[1]

I understand that the reason why it's not grabbing the first and last text in the p tag is because of my second if statement therefore I'm trying to figure out if there's a different way to parse this.

Once I'm able to grab each "Text 1" string I use regex to parse through each one to grab what I actually need so the desired output from this code would be next_s = "Text 1"

Upvotes: 1

Views: 40

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195623

For these kind of tasks you can use .get_text() with separator= parameter, then split on this separator:

from bs4 import BeautifulSoup
    
html_doc = """
<p>
Text 1
<br/>
Text 2
<br/>
Text 3
<br/>
Text 4
</p>
"""

soup = BeautifulSoup(html_doc, "html.parser")

texts = soup.find("p").get_text(strip=True, separator="|").split("|")  # use separator not included in the text
print(texts)

Prints:

['Text 1', 'Text 2', 'Text 3', 'Text 4']

To get only first text:

print(texts[0])

Prints:

Text 1

Or: Use .find_all() with text=True:

texts = [t.strip() for t in soup.find("p").find_all(text=True, recursive=False)]
print(texts)

Upvotes: 4

Related Questions