Gokul
Gokul

Reputation: 237

Scraping a webpage using BeautifulSoup4

I am trying to print the content of a news article using BeautifulSoup4.

The URL is: Link

The current code which I have is as follows which gives the desired output:

page = requests.get('http://www.thehindu.com/news/national/People-showing-monumental-patience-queuing-up-for-a-better-India-says-Venkaiah/article16447029.ece')
soup = BeautifulSoup(page.content, 'html.parser')


article_text = ""
table = soup.find_all("div",{ "id": "content-body-14266949-16447029"})                              

for element in table:
    article_text += ''.join(element.find_all(text = True)) + "\n\n"

print(article_text)

However, the problem is I want to scrape multiple pages and each of them has a different content body number in the format xxxxxxxx-xxxxxxxx (2 blocks of 8 digits.)

I tried replacing the soup.find_all command with regex as:

table = soup.find_all(text=re.compile("content-body-........-........"))

but this gives an error:

AttributeError: 'NavigableString' object has no attribute 'find_all'

Can anybody guide me to what needs to be done?

Thank you.

Upvotes: 1

Views: 159

Answers (3)

SIM
SIM

Reputation: 22440

Another approach may be with css selector. Selectors are clean and to the point. You might give it a try as well. Just change the "url" with your concerning link.

import requests ; from bs4 import BeautifulSoup

res = requests.get(url).text
soup = BeautifulSoup(res,"html.parser")

for item in soup.select("div[id^=content-body-] p"):
    print(item.text)

Upvotes: 1

Robbie Jones
Robbie Jones

Reputation: 391

Regular expressions should be fine! Try

table = soup.find_all("div",{ "id": re.compile('content-body-*')})

Upvotes: 2

Sun Yi
Sun Yi

Reputation: 39

you can use extract content by using lxml lxml library allow you use xpath to extract content from html

from lxml import etree
selector=etree.HTML(pageText)
article_text=selector.xpath('//div[@class="article-block-multiple live-snippet"]/div[1]')[0].text

i don't use BeautifulSoup.I think you can use BeautifulSoup like this

table = soup.find_all("div",{ "class": "article-block-multiple live-snippet"]"})

then use find child element ,find the first div element

Upvotes: 2

Related Questions