Getting error extracting image 'src' using Beautiful Soup

Question

I'm having difficulty extracting an image src using Python 2.7, beautifulsoup4 (4.2.1).

The HTML section I am interested in is:

  
      Buzzfeed  Jay L. Clendenin / Los Angeles Times
  Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.
  Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)

The code that I am running is:

image_section = soup.find(class_ = "trb_embed_media")
print image_section
print "================="
img = image_section.find('img')['src']
print img

The output of line 2 of the code above is:







         Buzzfeed


         Jay L. Clendenin / Los Angeles Times


         Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.


         Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)

As you can see from the img tag above. It is missing the src attribute, even through it is present in the original HTML source. What am I missing here. Please advise.

Martijn Pieters · Accepted Answer

That's because the original HTML source doesn't contain the src attribute, Javascript adds that attribute after the page is loaded.

The javascript code presumably uses the data-baseurl attribute to generate the src URL, adding a size and ratio.

The delayload and imgsize_item values in the data-role attribute on the parent

tag is a hint there too. You'll have to calculate your own aspect ratio from the given data-content-naturalheight and data-content-naturalwidth attributes and go from there.

If you resize the page you'll see that the site is using a responsive design; different image sizes are loaded based on how much horizontal space is available.

A quick experiment shows that you can fill in any size in the URL, as well as any aspect ratio, and the image is autogenerated based on those.

If you wanted to get the full size image, all you have to do is load the base URL; it returns the un-scaled image.

The javascript used to generate size and ratio picks among 16x9, 1x1 and 9x16 aspect ratios, based on the ratio between height and width from the data attributes:

img = soup.select('div.trb_embed_media img')[0]
width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
ratio = width / float(height)
ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
img_url = '{}/{}/{}'.format(img['data-baseurl'], width, ratio)

For your example that generates http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9, a valid image:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.latimes.com/business/la-fi-tn-buzzfeed-deal-20140811-story.html')
>>> soup = BeautifulSoup(r.content)
>>> img = soup.select('div.trb_embed_media img')[0]
>>> width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
>>> ratio = width / float(height)
>>> ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
>>> '{}/{}/{}'.format(img['data-baseurl'], width, ratio)
'http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9'

Getting error extracting image 'src' using Beautiful Soup

Answers (1)

Related Questions

Getting error extracting image &#39;src&#39; using Beautiful Soup

Answers (1)

Related Questions

Getting error extracting image 'src' using Beautiful Soup