Reputation: 29
I am very new to BeauitfulSoup.
How would I be able to extract the text in a paragraph from an html source code, split the text whenever there is a <br/>, and store it into an array such that each element in the array is a chunk from the paragraph text (that was split by a <br/>)?
For example, for the following paragraph:
<p>
<strong>Pancakes</strong>
<br/>
A <strong>delicious</strong> type of food
<br/>
</p>
I would like it to be stored into the following array:
['Pancakes', 'A delicious type of food']
What I have tried is:
import bs4 as bs
soup = bs.BeautifulSoup("<p>Pancakes<br/> A delicious type of food<br/></p>")
p = soup.findAll('p')
p[0] = p[0].getText()
print(p)
but this outputs an array with only one element:
['Pancakes A delicious type of food']
What is a way to code it so that I can get an array that contains the paragraph text split by any <br/> in the paragraph?
Upvotes: 1
Views: 2412
Reputation: 11
I stumbled across this whilst having a similar issue. This was my solution... A simple way is to replace the line
p[0] = p[0].getText()
with
p[0].getText('#').split('#')
Result is: ['Pancakes', ' A delicious type of food']
Obv choose a character/characters that won't appear in the text
Upvotes: 0
Reputation: 13057
try this
from bs4 import BeautifulSoup, NavigableString
html = '<p>Pancakes<br/> A delicious type of food<br/></p>'
soup = BeautifulSoup(html, 'html.parser')
p = soup.findAll('p')
result = [str(child).strip() for child in p[0].children
if isinstance(child, NavigableString)]
Update for deep recursive
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p').find_all(text=True, recursive=True)
Update again for text split only by <br>
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
text = ''
for child in soup.find_all('p')[0]:
if isinstance(child, NavigableString):
text += str(child).strip()
elif isinstance(child, Tag):
if child.name != 'br':
text += child.text.strip()
else:
text += '\n'
result = text.strip().split('\n')
print(result)
Upvotes: 3