Reputation: 421
I'm trying to scrape a site using in python using BS4 that follows this pattern:
<p>
Text 1
<br/>
Text 2
<br/>
Text 3
<br/>
Text 4
</p>
The code I wrote to do this skips "Text 1" and "Text 4":
for br in scraper.findAll('br'):
next_s = br.nextSibling
if not (next_s and isinstance(next_s,NavigableString)):
continue
next2_s = next_s.nextSibling
if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
text = str(next_s).strip()
if text:
wanted_text = next_s.split('Text ')[1]
I understand that the reason why it's not grabbing the first and last text in the p tag is because of my second if statement therefore I'm trying to figure out if there's a different way to parse this.
Once I'm able to grab each "Text 1" string I use regex to parse through each one to grab what I actually need so the desired output from this code would be next_s = "Text 1"
Upvotes: 1
Views: 40
Reputation: 195623
For these kind of tasks you can use .get_text()
with separator=
parameter, then split on this separator:
from bs4 import BeautifulSoup
html_doc = """
<p>
Text 1
<br/>
Text 2
<br/>
Text 3
<br/>
Text 4
</p>
"""
soup = BeautifulSoup(html_doc, "html.parser")
texts = soup.find("p").get_text(strip=True, separator="|").split("|") # use separator not included in the text
print(texts)
Prints:
['Text 1', 'Text 2', 'Text 3', 'Text 4']
To get only first text:
print(texts[0])
Prints:
Text 1
Or: Use .find_all()
with text=True
:
texts = [t.strip() for t in soup.find("p").find_all(text=True, recursive=False)]
print(texts)
Upvotes: 4