Reputation: 2232
I currently use following Python code excerpt to get all
elements of a webpage:
def scraping(url, html):
data = {}
soup = BeautifulSoup(html,"lxml")
data["news"] = []
page = soup.find("div", {"class":"container_news"}).findAll('p')
page_text = ''
for p in page:
page_text += ''.join(p.findAll(text = True))
data["news"].append(page_text)
print(page_text)
return data
However, the output of page_text
looks like:
"['New news on the internet. ', 'Here is some text. ', ""Here is some other."", ""And then there are other variations \n\nLooks like there are some non-text elements. \n\xa0""]" ...
Is it possible to get the content cleaner and merge the lists into one string? BeautifulSoup solutions would be preferred over regex variants.
Thank you!
Upvotes: 0
Views: 537
Reputation: 81614
I'm not sure of the significance of maintaining data["news"]
, but this can be done in a single line:
page_text = ' '.join(e.text for p in page for e in p.findAll(text=True))
Instead of ' '
you can use whatever string you want as delimiter.
Otherwise
page_text = []
for p in page:
page_text.extend(e.text for e in p.findAll(text=True))
data["news"].append(page_text)
print(' '.join(page_text))
Upvotes: 4