Extracting an articles text using BeautifulSoup

Question

I am trying to extract all of the text from an article using BeautifulSoup. I can separate all of the article's text from the preceding and following HTML but I can not figure out how to separate the text from within all of it's embedded HTML code. Here is my code:

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/tata-consultancy-services-reports-broad-based-growth-across-markets-marks-steady-fy17-300440934.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('p', {'itemprop': 'articleBody'})

Links contains all of the article text but it is broken into several segments.

Any ideas on how to separate and combine all of the article text segments from the HTML interspersed within the article's text would be greatly appreciated.

odradek · Accepted Answer

you can use the get_text method which returns all text beneath a tag:

links = [e.get_text() for e in soup.find_all('p', {'itemprop': 'articleBody'})]

then join it however you want:

article = '
'.join(links)
print len(article)

would output

$ 6485

Extracting an articles text using BeautifulSoup

Answers (1)

Related Questions