Get absolute paths to images instead of blank.gif

Question

I'm processing a few webs, so that I have a list with absolute paths to images,with the following code:

for img in images:
    try:
      if img["src"].startswith("http"):
        abs_img_url = img["src"]
      else:
        abs_img_url = urljoin(url, img["src"])
    except KeyError:
      # src attribute does not exist
      continue

The problem is with this webpage, where I will get a lot of blank.gif images, though the browser will display other file which is stored in img["data-original"] attribute. It's surprising that Firefox inspector shows the correct image in img["src"] but when you view the source you see it in img["data-original"].

Could you explain this issue and how would you treat it programaticaly to detect and download the right image, instead of blank.gif?

example image element giving bad result:

Dalvenjia · Accepted Answer

The thing is that JavaScript is changing the source from the data-original attribute to the src attribute dynamically on load, and since BeautifulSoup does not process JS you end up with invalid src attributes to your images. With this in mind you have 2 options, either parse the data-original attribute or change your approach to something that process the JS code before you parse the page like Selenium or CasperJS or PhantomJS.

I think that searching for the correct attribute is a good way to go without overcomplicating your scraper.

my_images = []

for img in images:
    try:
        if img['src'].endswith('blank.gif'):
            my_images.append(img['data-original'])
        else:
            my_images.append(img['src'])
    except KeyError:
        continue

my_abs_images = [img if img.startswith('http') else urljoin(url, img) for img in my_images]

Get absolute paths to images instead of blank.gif

Answers (2)

Related Questions