Reputation: 135
I am trying to extract all of the text from an article using BeautifulSoup. I can separate all of the article's text from the preceding and following HTML but I can not figure out how to separate the text from within all of it's embedded HTML code. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/tata-consultancy-services-reports-broad-based-growth-across-markets-marks-steady-fy17-300440934.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('p', {'itemprop': 'articleBody'})
Links contains all of the article text but it is broken into several segments.
Any ideas on how to separate and combine all of the article text segments from the HTML interspersed within the article's text would be greatly appreciated.
Upvotes: 0
Views: 3770