Reputation: 237
I am trying to print the content of a news article using BeautifulSoup4.
The URL is: Link
The current code which I have is as follows which gives the desired output:
page = requests.get('http://www.thehindu.com/news/national/People-showing-monumental-patience-queuing-up-for-a-better-India-says-Venkaiah/article16447029.ece')
soup = BeautifulSoup(page.content, 'html.parser')
article_text = ""
table = soup.find_all("div",{ "id": "content-body-14266949-16447029"})
for element in table:
article_text += ''.join(element.find_all(text = True)) + "\n\n"
print(article_text)
However, the problem is I want to scrape multiple pages and each of them has a different content body number in the format xxxxxxxx-xxxxxxxx (2 blocks of 8 digits.)
I tried replacing the soup.find_all command with regex as:
table = soup.find_all(text=re.compile("content-body-........-........"))
but this gives an error:
AttributeError: 'NavigableString' object has no attribute 'find_all'
Can anybody guide me to what needs to be done?
Thank you.
Upvotes: 1
Views: 159
Reputation: 22440
Another approach may be with css selector. Selectors are clean and to the point. You might give it a try as well. Just change the "url" with your concerning link.
import requests ; from bs4 import BeautifulSoup
res = requests.get(url).text
soup = BeautifulSoup(res,"html.parser")
for item in soup.select("div[id^=content-body-] p"):
print(item.text)
Upvotes: 1
Reputation: 391
Regular expressions should be fine! Try
table = soup.find_all("div",{ "id": re.compile('content-body-*')})
Upvotes: 2
Reputation: 39
you can use extract content by using lxml lxml library allow you use xpath to extract content from html
from lxml import etree
selector=etree.HTML(pageText)
article_text=selector.xpath('//div[@class="article-block-multiple live-snippet"]/div[1]')[0].text
i don't use BeautifulSoup.I think you can use BeautifulSoup like this
table = soup.find_all("div",{ "class": "article-block-multiple live-snippet"]"})
then use find child element ,find the first div element
Upvotes: 2