Reputation: 141
Can someone help me extract some data from the below sample html using beautiful soup python? These are what i'm trying to extract:
The href html link : example
/movies/watch-malayalam-movies-online/6106-watch-buddy.html
The alt text which has the movie name :
Buddy 2013 Malayalam Movie
The thumbnail : example http://i44.tinypic.com/2lo14b8.jpg
(There are multiple occurrences of these..)
Full source available at : http:\\olangal.com
Sample html :
<div class="item column-1">
<h2>
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Buddy
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=36bbe22fb7c54b5465609b8a2c60d8c8a1841581" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt=" Buddy 2013 Malayalam Movie" src="http://i44.tinypic.com/2lo14b8.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
<div class="item column-2">
<h2>
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Pigman
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=2b0dfb09b41b8e6fabfd7ed2a035f4d728bedb1a" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt="Pigman 2013 Malayalam Movie" src="http://i41.tinypic.com/jpa3ko.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
Update : Finally cracked it with help from @kroolik. Thanks to you.
Here's what worked for me:
for eachItem in soup.findAll("div", { "class":"item" }):
eachItem.ul.decompose()
imglinks = eachItem.find_all('img')
for imglink in imglinks:
imgfullLink = imglink.get('src').strip()
links = eachItem.find_all('a')
for link in links:
names = link.contents[0].strip()
fullLink = "http://olangal.com"+link.get('href').strip()
print "Extracted : " + names + " , " + imgfullLink+" , "+fullLink
Upvotes: 1
Views: 2975
Reputation: 15864
You can get both <img width="110">
and <p class="read more">
using the following:
for div in soup.find_all(class_='item'):
# Will match `<p class="readmore">...</p>` that is direct
# child of the div.
p = div.find(class_='readmore', recursive=False)
# Will print `href` attribute of the first `<a>` element
# inside `p`.
print p.a['href']
# Will match `<img width="110">` that is direct child
# of the div.
img = div.find('img', width=110, recursive=False)
print img['src'], img['alt']
Note that this is for the most recent Beautiful Soup version.
Upvotes: 3
Reputation: 3579
I usually use PyQuery for such scrapping, it's clean and easy. You can use jQuery selectors directly with it. e.g to see your Name and reputation, I will just have to write something like
from pyquery import PyQuery as pq
d = pq(url = 'http://stackoverflow.com/users/1234402/gbzygil')
p=d('#user-displayname')
t=d('#user-panel-reputation div h1 a span')
print p.html()
So unless you can't switch from bsoup, I will strongly recommend switching to PyQuery or some library that supports XPath well.
Upvotes: 0