Reputation: 123
Here I am trying to get the news from the RSS feed and I am not getting the exact information. I am using the requests and BeautifulSoup to achieve the goal. I have the following object.
<item>
<title>
US making very good headway in respect to Covid-19 vaccines: Donald Trump
</title>
<description>
<a href="https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms"><img border="0" hspace="10" align="left" style="margin-top:3px;margin-right:5px;" src="https://timesofindia.indiatimes.com/photo/76399892.cms" /></a>Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.
</description>
<link>
https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
</link>
<guid>
https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
</guid>
<pubDate>
Mon, 15 Jun 2020 22:11:06 PT
</pubDate>
</item>
The code for the desire problem is here..
def timesofindiaNews():
URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'
page = requests.get(URL)
soup = BeautifulSoup(page.content, features = 'xml')
# print(soup.prettify())
news_elems = soup.find_all('item')
news = []
print(news_elems[0].prettify())
for news_elem in news_elems:
title = news_elem.title.text
news_description = news_elem.description.text
image = news_elem.description.img
# news_date = news_elem.pubDate.text
news_link = news_elem.link.text
I want the description from the tag but the contains the more details like and which is not require in the description. The above code give the following output.
{
"image": null,
"news_description": "<a href=\"https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms\"><img border=\"0\" hspace=\"10\" align=\"left\" style=\"margin-top:3px;margin-right:5px;\" src=\"https://timesofindia.indiatimes.com/photo/76399892.cms\" /></a>Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
"news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
"source": "trucknews",
"title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
}
Expected output ===>
{
"image": "image/link/from/the/description",
"news_description": "Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
"news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
"source": "trucknews",
"title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
}
Upvotes: 1
Views: 59
Reputation: 1560
< >
changed to <
and >
. Thats why I use formatter=None
and changing someting to control it.Please see the news_description
. I think you got your result. you can try it:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
def timesofindiaNews():
URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.text, 'xml')
# print(soup.prettify())
news_elems = soup.find_all('item')
news = []
# print(news_elems[0].prettify())
for news_elem in news_elems:
title = news_elem.title.text
n_description = news_elem.description
store = n_description.prettify(formatter=None)
sp = BeautifulSoup(store, 'xml')
news_description = sp.find("a").nextSibling
print(news_description)
# print(news_description)
image = news_elem.description.img
# news_date = news_elem.pubDate.text
news_link = news_elem.link.text
timesofindiaNews()
output will be:
Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.
The proposed suspension could extend into the government's new fiscal year beginning October 1, when many new visas are issued, The Wall Street Journal reported on Thursday, quoting unnamed administration officials.
The team of researchers at the University of Georgia (UGA) in the US noted that the SARS-CoV-2 protein PLpro is essential for the replication and the ability of the virus to suppress host immune function.
After two weeks of protests over the death of George Floyd, hundreds of New Yorkers took to the streets again calling for reform in law enforcement and the withdrawal of police department funding.
Indian-origin California Senator Kamala Harris has joined former vice president and 2020 Democratic presidential nominee Joe Biden to raise USD 3.5 million for the upcoming November elections.
and so on....
Upvotes: 1