Piyush Ghasiya
Piyush Ghasiya

Reputation: 515

Can't fetch the content of articles using beautifulsoup in python 3.7

I am doing web-scraping using beautifulsoup in python 3.7. The code below is successfully scraping date, title, tags but not the content of the articles. It is giving None instead.

import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
url = 'https://www.thehindu.com/search/?q=cybersecurity&order=DESC&sort=publishdate&ct=text&page={}'
pages = 32
for page in range(4, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.find_all("a", {"class": "story-card75x1-text"}, href=True):
        _href = item.get("href")
        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
                resp = requests.get("https://www.thehindu.com"+_href)
            except Exception as e:
                continue

        dateTag = soup.find("span", {"class": "dateline"})
        sauce = BeautifulSoup(resp.text,"lxml")
        tag = sauce.find("a", {"class": "section-name"})
        titleTag = sauce.find("h1", {"class": "title"})
        contentTag = sauce.find("div", {"class": "_yeti_done"})

        date = None
        tagName = None
        title = None
        content = None

        if isinstance(dateTag,Tag):
            date = dateTag.get_text().strip()

        if isinstance(tag,Tag):
            tagName = tag.get_text().strip()

        if isinstance(titleTag,Tag):
            title = titleTag.get_text().strip()

        if isinstance(contentTag,Tag):
            content = contentTag.get_text().strip()

        print(f'{date}\n {tagName}\n {title}\n {content}\n')

        time.sleep(3)

I don't see where is the problem as I am writing the correct class in contentTag.

Thanks.

Upvotes: 0

Views: 136

Answers (1)

SIM
SIM

Reputation: 22440

I guess the links you would like to follow from first page to it's inner page end with .ece. I've applied that logic within the script to traverse those target pages to scrape data from. I've defined selectors for content slightly differently. Now it appears to be working correctly. The following script only scrapes data from page 1. Feel free to change it as per your requirement.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = 'https://www.thehindu.com/search/?q=cybersecurity&order=DESC&sort=publishdate&ct=text&page=1'
base = "https://www.thehindu.com"

res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".story-card-news a[href$='.ece']"):
    resp = requests.get(urljoin(base,item.get("href")))
    sauce = BeautifulSoup(resp.text,"lxml")
    title = item.get_text(strip=True)
    content = ' '.join([item.get_text(strip=True) for item in sauce.select("[id^='content-body-'] p")])
    print(f'{title}\n {content}\n')

Upvotes: 1

Related Questions