user2129623
user2129623

Reputation: 2257

Scraping page content from divs using beautufulsoup

I am trying to scrape the title, summary, date, and link from http://www.indiainfoline.com/top-news for each div. with class' : 'row'.

link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
productDivs = soup.findAll('div', attrs={'class' : 'row'})
for div in productDivs:
    result = {}
    try:
        import pdb
        #pdb.set_trace()
        heading = div.find('p', attrs={'class': 'heading fs20e robo_slab mb10'}).get_text()
        title = heading.get_text()
        article_link = "http://www.indiainfoline.com"+heading.find('a')['href']
        summary = div.find('p')

But none of the components are getting fetched. Any suggestion on how to fix this?

Upvotes: 2

Views: 118

Answers (2)

akash karothiya
akash karothiya

Reputation: 5950

See there are many class=row in html source code, you need to filter out the section chunk where actual row data exist. In you case under id="search-list" all 16 expected rows exist. Thus first extract section and then row. Since .select return array, we must use [0] to extract data. Once you got row data you need to iterate and extract heading,articl_url, summary etc ..

from bs4 import BeautifulSoup
link = 'http://www.indiainfoline.com/top-news'
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml, "lxml")
section = soup.select('#search-list')
rowdata = section[0].select('.row')

for row in rowdata[1:]:
    heading = row.select('.heading.fs20e.robo_slab.mb10')[0].text
    title = 'http://www.indiainfoline.com'+row.select('a')[0]['href']
    summary = row.select('p')[0].text

Output :

PFC board to consider bonus issue; stock surges by 4%     
http://www.indiainfoline.com/article/news-top-story/pfc-pfc-board-to-consider-bonus-issue-stock-surges-by-4-117080300814_1.html
PFC board to consider bonus issue; stock surges by 4%
...
...

Upvotes: 2

MishaVacic
MishaVacic

Reputation: 1887

Try this

from bs4 import BeautifulSoup
from urllib.request import urlopen 

link = 'http://www.indiainfoline.com/top-news'
soup = BeautifulSoup(urlopen(link),"lxml")
fixed_html = soup.prettify()

ul = soup.find('ul', attrs={'class':'row'})
print(ul.find('li'))

You will get

<li class="animated" onclick="location.href='/article/news-top-story/lupin-lupin-gets-usfda-nod-to-market-rosuvastatin-calcium-117080300815_1.html';">
<div class="row">
<div class="col-lg-9 col-md-9 col-sm-9 col-xs-12 ">
<p class="heading fs20e robo_slab mb10"><a href="/article/news-top-story/lupin-lupin-gets-usfda-nod-to-market-rosuvastatin-calcium-117080300815_1.html">Lupin gets USFDA nod to market Rosuvastatin Calcium</a></p>
<p><!--style="color: green !important"-->
<img class="img-responsive visible-xs mob-img" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
                                            Pharma major, Lupin announced on Thursday that the company has received the United States Food and Drug Administra...
                                                                        </p>
<p class="source fs12e">India Infoline News Service |                                           
                                            Mumbai                          15:42 IST |                                          August 03, 2017                 </p>
</div>
<div class="col-lg-3 col-md-3 col-sm-3 hidden-xs pl0 listing-image">
<img class="img-responsive" src="http://content.indiainfoline.com/_media/iifl/img/article/2016-08/19/full/1471586016-9754.jpg"/>
</div>
</div>
</li>

Of course you can print fixed_html to get the whole site content.

Upvotes: 1

Related Questions