gbzygil
gbzygil

Reputation: 141

How to extract href, alt and imgsrc using beautiful soup python

Can someone help me extract some data from the below sample html using beautiful soup python? These are what i'm trying to extract:

The href html link : example /movies/watch-malayalam-movies-online/6106-watch-buddy.html
The alt text which has the movie name : Buddy 2013 Malayalam Movie
The thumbnail : example http://i44.tinypic.com/2lo14b8.jpg

(There are multiple occurrences of these..)

Full source available at : http:\\olangal.com

Sample html :

 <div class="item column-1">
  <h2>
   <a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
    Buddy
   </a>
  </h2>
  <ul class="actions">
   <li class="email-icon">
    <a href="/component/mailto/?tmpl=component&amp;template=beez_20&amp;link=36bbe22fb7c54b5465609b8a2c60d8c8a1841581" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
     <img src="/media/system/images/emailButton.png" alt="Email" />
    </a>
   </li>
  </ul>
  <img width="110" height="105" alt=" Buddy 2013 Malayalam Movie" src="http://i44.tinypic.com/2lo14b8.jpg" border="0" />
  <p class="readmore">
   <a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
    Read more...
   </a>
  </p>
  <div class="item-separator">
  </div>
 </div>
 <div class="item column-2">
  <h2>
   <a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
    Pigman
   </a>
  </h2>
  <ul class="actions">
   <li class="email-icon">
    <a href="/component/mailto/?tmpl=component&amp;template=beez_20&amp;link=2b0dfb09b41b8e6fabfd7ed2a035f4d728bedb1a" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
     <img src="/media/system/images/emailButton.png" alt="Email" />
    </a>
   </li>
  </ul>
  <img width="110" height="105" alt="Pigman 2013 Malayalam Movie" src="http://i41.tinypic.com/jpa3ko.jpg" border="0" />
  <p class="readmore">
   <a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
    Read more...
   </a>
  </p>
  <div class="item-separator">
  </div>
 </div>

Update : Finally cracked it with help from @kroolik. Thanks to you.

Here's what worked for me:

for eachItem in soup.findAll("div", { "class":"item" }):
     eachItem.ul.decompose()

     imglinks = eachItem.find_all('img')
     for imglink in imglinks:
          imgfullLink = imglink.get('src').strip()

     links = eachItem.find_all('a')
     for link in links:
          names = link.contents[0].strip()
          fullLink = "http://olangal.com"+link.get('href').strip()
          print "Extracted : " + names + " , " + imgfullLink+" , "+fullLink

Upvotes: 1

Views: 2975

Answers (2)

Maciej Gol
Maciej Gol

Reputation: 15864

You can get both <img width="110"> and <p class="read more"> using the following:

for div in soup.find_all(class_='item'):
    # Will match `<p class="readmore">...</p>` that is direct
    # child of the div.
    p = div.find(class_='readmore', recursive=False)

    # Will print `href` attribute of the first `<a>` element
    # inside `p`.
    print p.a['href']

    # Will match `<img width="110">` that is direct child
    # of the div.
    img = div.find('img', width=110, recursive=False)

    print img['src'], img['alt']

Note that this is for the most recent Beautiful Soup version.

Upvotes: 3

Aman Gautam
Aman Gautam

Reputation: 3579

I usually use PyQuery for such scrapping, it's clean and easy. You can use jQuery selectors directly with it. e.g to see your Name and reputation, I will just have to write something like

from pyquery import PyQuery as pq

d = pq(url = 'http://stackoverflow.com/users/1234402/gbzygil')
p=d('#user-displayname')
t=d('#user-panel-reputation div h1 a span')
print p.html()

So unless you can't switch from bsoup, I will strongly recommend switching to PyQuery or some library that supports XPath well.

Upvotes: 0

Related Questions