xralf
xralf

Reputation: 3692

Get absolute paths to images instead of blank.gif

I'm processing a few webs, so that I have a list with absolute paths to images,with the following code:

for img in images:
    try:
      if img["src"].startswith("http"):
        abs_img_url = img["src"]
      else:
        abs_img_url = urljoin(url, img["src"])
    except KeyError:
      # src attribute does not exist
      continue

The problem is with this webpage, where I will get a lot of blank.gif images, though the browser will display other file which is stored in img["data-original"] attribute. It's surprising that Firefox inspector shows the correct image in img["src"] but when you view the source you see it in img["data-original"].

Could you explain this issue and how would you treat it programaticaly to detect and download the right image, instead of blank.gif?

example image element giving bad result:

<img alt="browser cache backend" class="lazy aligncenter size-full wp-image-57323" data-original="http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/uploads/2008/06/browser-cache-backend.jpg" height="190" itemprop="image" sizes="(max-width: 540px) 100vw, 540px" src="http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/themes/online-tech-tips-2013/images/blank.gif" srcset="http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/uploads/2008/06/browser-cache-backend.jpg 540w, http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/uploads/2008/06/browser-cache-backend-300x106.jpg 300w, http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/uploads/2008/06/browser-cache-backend-80x28.jpg 80w" width="540"/>

Upvotes: 0

Views: 395

Answers (2)

Dalvenjia
Dalvenjia

Reputation: 2033

The thing is that JavaScript is changing the source from the data-original attribute to the src attribute dynamically on load, and since BeautifulSoup does not process JS you end up with invalid src attributes to your images. With this in mind you have 2 options, either parse the data-original attribute or change your approach to something that process the JS code before you parse the page like Selenium or CasperJS or PhantomJS.

I think that searching for the correct attribute is a good way to go without overcomplicating your scraper.

my_images = []

for img in images:
    try:
        if img['src'].endswith('blank.gif'):
            my_images.append(img['data-original'])
        else:
            my_images.append(img['src'])
    except KeyError:
        continue

my_abs_images = [img if img.startswith('http') else urljoin(url, img) for img in my_images]

Upvotes: 1

Numlet
Numlet

Reputation: 837

Try adding a conditional for detecting wether if the image is called blank.gif:

for img in images:
    try:
        if img["src"].startswith("http"):

        abs_img_url = img["src"]

        if img["src"][-9:]=='blank.gif':
            abs_img_url = img["data-original"]
      else:
        abs_img_url = urljoin(url, img["src"])
    except KeyError:
        # src attribute does not exist
        continue

Upvotes: 1

Related Questions