Ziva
Ziva

Reputation: 3501

Python: extract the href surrounding image

I am using bs4 and want to extract a href of a specified image. For example in the html code I have:

<div style="text-align:center;"><a href="page/folder1/image.jpg" target="_blank"><img src="page_files/image.jpg" alt="Picture" border="0" width="150" height="150"></a></div>
</div>

And I have my image src given (page_files/image.jpg) and want to extract corresponding href, so in this example it is: page/folder1/image.jpg. I was trying to use find_previous method, but I have a small problem to extract the href content:

soup = bs4.BeautifulSoup(page)
for img in soup('img'):
  imgLink = img.find_previous("a")

This returns the whole tag:

<a href="Here_is_link"><img alt="Tumblr" border="0" src="Here_is_source"/></a>

But I can't take the href content, because when I try to make:

imgLink = img.find_previous("a")['href']

I have an error. The same thing is when I try to use find_parent like

imgLink = img.find_parent("a")['href']

How can I fix that? And what is better: find_previous() or find_parent()?

Upvotes: 2

Views: 1827

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122172

Make sure you are only looking for images that have a <a> parent tag with href attribute:

for img in soup.select('a[href] img'):
    link = img.find_parent('a', href=True)
    print link['href']

The CSS selector picks only images that have an <a href="..."> parent tag with an href attribute. The find_parent() search then again limits the search to those tags that have the attribute set.

If you are searching for all images, chances are you are finding some that have a <a> tag parent or preceding tag that does not have the a href attribute; <a> tags can also be used for link targets with <a name="...">, for example. If you are getting NoneType attribute errors, that simply means there is no such parent tag for the given <img> tag.

Upvotes: 4

Related Questions