I am doing RSS feed news scrapting using python3.7. I am not get the exact information. Help me to get the proper data

Question

Here I am trying to get the news from the RSS feed and I am not getting the exact information. I am using the requests and BeautifulSoup to achieve the goal. I have the following object.


 
  US making very good headway in respect to Covid-19 vaccines: Donald Trump
 
 
  Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.
 
 
  https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
 
 
  https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
 
 
  Mon, 15 Jun 2020 22:11:06 PT

The code for the desire problem is here..

def timesofindiaNews():
    URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'

    page = requests.get(URL)
    soup = BeautifulSoup(page.content, features = 'xml')

    # print(soup.prettify())

    news_elems = soup.find_all('item')
    news = []
    print(news_elems[0].prettify())
    for news_elem in news_elems:

        title = news_elem.title.text
        news_description = news_elem.description.text       
        image = news_elem.description.img
        # news_date = news_elem.pubDate.text
        news_link = news_elem.link.text

I want the description from the tag but the contains the more details like and which is not require in the description. The above code give the following output.

    {
      "image": null,
      "news_description": "Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
      "news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
      "source": "trucknews",
      "title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
    }

Expected output ===>

    {
      "image": "image/link/from/the/description",
      "news_description": "Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
      "news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
      "source": "trucknews",
      "title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
    }

Humayun Ahmad Rajib · Accepted Answer

< > changed to < and >. Thats why I use formatter=None and changing someting to control it.Please see the news_description. I think you got your result. you can try it:

import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}


def timesofindiaNews():
    URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'

    page = requests.get(URL,headers=headers)
    soup = BeautifulSoup(page.text, 'xml')

    # print(soup.prettify())

    news_elems = soup.find_all('item')
    news = []
    # print(news_elems[0].prettify())
    for news_elem in news_elems:

        title = news_elem.title.text
        n_description = news_elem.description
        store = n_description.prettify(formatter=None)
        sp = BeautifulSoup(store, 'xml')
        news_description = sp.find("a").nextSibling
        print(news_description)
        # print(news_description)
        image = news_elem.description.img
        # news_date = news_elem.pubDate.text
        news_link = news_elem.link.text


timesofindiaNews()

output will be:

Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.

The proposed suspension could extend into the government's new fiscal year beginning October 1, when many new visas are issued, The Wall Street Journal reported on Thursday, quoting unnamed administration officials.

The team of researchers at the University of Georgia (UGA) in the US noted that the SARS-CoV-2 protein PLpro is essential for the replication and the ability of the virus to suppress host immune function.

After two weeks of protests over the death of George Floyd, hundreds of New Yorkers took to the streets again calling for reform in law enforcement and the withdrawal of police department funding.

Indian-origin California Senator Kamala Harris has joined former vice president and 2020 Democratic presidential nominee Joe Biden to raise USD 3.5 million for the upcoming November elections.


and so on....

I am doing RSS feed news scrapting using python3.7. I am not get the exact information. Help me to get the proper data

Answers (1)

Related Questions