Python: extract the href surrounding image

Question

I am using bs4 and want to extract a href of a specified image. For example in the html code I have:

And I have my image src given (page_files/image.jpg) and want to extract corresponding href, so in this example it is: page/folder1/image.jpg. I was trying to use find_previous method, but I have a small problem to extract the href content:

soup = bs4.BeautifulSoup(page)
for img in soup('img'):
  imgLink = img.find_previous("a")

This returns the whole tag:

But I can't take the href content, because when I try to make:

imgLink = img.find_previous("a")['href']

I have an error. The same thing is when I try to use find_parent like

imgLink = img.find_parent("a")['href']

How can I fix that? And what is better: find_previous() or find_parent()?

Martijn Pieters · Accepted Answer

Make sure you are only looking for images that have a parent tag with href attribute:

for img in soup.select('a[href] img'):
    link = img.find_parent('a', href=True)
    print link['href']

The CSS selector picks only images that have an parent tag with an href attribute. The find_parent() search then again limits the search to those tags that have the attribute set.

If you are searching for all images, chances are you are finding some that have a tag parent or preceding tag that does not have the a href attribute; tags can also be used for link targets with , for example. If you are getting NoneType attribute errors, that simply means there is no such parent tag for the given tag.

Python: extract the href surrounding image

Answers (1)

Related Questions