Bill Orton
Bill Orton

Reputation: 135

Extracting an articles text using BeautifulSoup

I am trying to extract all of the text from an article using BeautifulSoup. I can separate all of the article's text from the preceding and following HTML but I can not figure out how to separate the text from within all of it's embedded HTML code. Here is my code:

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/tata-consultancy-services-reports-broad-based-growth-across-markets-marks-steady-fy17-300440934.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('p', {'itemprop': 'articleBody'})

Links contains all of the article text but it is broken into several segments.

Any ideas on how to separate and combine all of the article text segments from the HTML interspersed within the article's text would be greatly appreciated.

Upvotes: 0

Views: 3770

Answers (1)

odradek
odradek

Reputation: 1001

you can use the get_text method which returns all text beneath a tag:

links = [e.get_text() for e in soup.find_all('p', {'itemprop': 'articleBody'})]

then join it however you want:

article = '\n'.join(links)
print len(article)

would output

$ 6485

Upvotes: 1

Related Questions