Reputation: 2257
I am trying to scrape the title, summary, date, and link from http://www.indiainfoline.com/top-news for each div. with class' : 'row'
.
link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
productDivs = soup.findAll('div', attrs={'class' : 'row'})
for div in productDivs:
result = {}
try:
import pdb
#pdb.set_trace()
heading = div.find('p', attrs={'class': 'heading fs20e robo_slab mb10'}).get_text()
title = heading.get_text()
article_link = "http://www.indiainfoline.com"+heading.find('a')['href']
summary = div.find('p')
But none of the components are getting fetched. Any suggestion on how to fix this?
Upvotes: 2
Views: 118
Reputation: 5950
See there are many class=row
in html source code, you need to filter out the section chunk where actual row data exist. In you case under id="search-list"
all 16 expected rows exist. Thus first extract section and then row. Since .select
return array, we must use [0]
to extract data. Once you got row data you need to iterate and extract heading,articl_url, summary etc ..
from bs4 import BeautifulSoup
link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
section = soup.select('#search-list')
rowdata = section[0].select('.row')
for row in rowdata[1:]:
heading = row.select('.heading.fs20e.robo_slab.mb10')[0].text
title = 'http://www.indiainfoline.com'+row.select('a')[0]['href']
summary = row.select('p')[0].text
Output :
PFC board to consider bonus issue; stock surges by 4%
http://www.indiainfoline.com/article/news-top-story/pfc-pfc-board-to-consider-bonus-issue-stock-surges-by-4-117080300814_1.html
PFC board to consider bonus issue; stock surges by 4%
...
...
Upvotes: 2
Reputation: 1887
Try this
from bs4 import BeautifulSoup
from urllib.request import urlopen
link = 'http://www.indiainfoline.com/top-news'
soup = BeautifulSoup(urlopen(link),"lxml")
fixed_html = soup.prettify()
ul = soup.find('ul', attrs={'class':'row'})
print(ul.find('li'))
You will get
<li class="animated" onclick="location.href='/article/news-top-story/lupin-lupin-gets-usfda-nod-to-market-rosuvastatin-calcium-117080300815_1.html';">
<div class="row">
<div class="col-lg-9 col-md-9 col-sm-9 col-xs-12 ">
<p class="heading fs20e robo_slab mb10"><a href="/article/news-top-story/lupin-lupin-gets-usfda-nod-to-market-rosuvastatin-calcium-117080300815_1.html">Lupin gets USFDA nod to market Rosuvastatin Calcium</a></p>
<p><!--style="color: green !important"-->
<img class="img-responsive visible-xs mob-img" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
Pharma major, Lupin announced on Thursday that the company has received the United States Food and Drug Administra...
</p>
<p class="source fs12e">India Infoline News Service |
Mumbai 15:42 IST | August 03, 2017 </p>
</div>
<div class="col-lg-3 col-md-3 col-sm-3 hidden-xs pl0 listing-image">
<img class="img-responsive" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
</div>
</div>
</li>
Of course you can print fixed_html to get the whole site content.
Upvotes: 1