timebandit
timebandit

Reputation: 830

Getting error extracting image 'src' using Beautiful Soup

I'm having difficulty extracting an image src using Python 2.7, beautifulsoup4 (4.2.1).

The HTML section I am interested in is:

<div class="trb_embed_media ">  <figure imgratio="16x9" imgwidth="750" imgheight="450" data-role="imgsize_item" class="trb_embed_imageContainer_figure"><img src="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/750/16x9" data-height="450" data-width="750" data-ratio="16x9" itemprop="image" data-baseurl="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007" alt="Buzzfeed" class="trb_embed_imageContainer_img" title="Buzzfeed" data-content-naturalwidth="2048" data-content-naturalheight="1365"></figure><div class="trb_embed_related" data-role="lightbox_metadata">      <span class="trb_embed_related_title">Buzzfeed</span>  <div class="trb_embed_related_credit">Jay L. Clendenin / Los Angeles Times</div>  <div class="trb_embed_related_caption">Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.</div>  <div class="trb_embed_related_credit_and_caption">Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)</div></div>    </div>

The code that I am running is:

image_section = soup.find(class_ = "trb_embed_media")
print image_section
print "================="
img = image_section.find('img')['src']
print img

The output of line 2 of the code above is:

<div class="trb_embed_media ">
<figure class="trb_embed_imageContainer_figure" data-role=" delayload  delayload_done imgsize_item">
<img alt="Buzzfeed" class="trb_embed_imageContainer_img" data-baseurl="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007" data-content-naturalheight="1365" data-content-naturalwidth="2048" itemprop="image" title="Buzzfeed"/>
</figure>
<div class="trb_embed_related" data-role="lightbox_metadata">
<span class="trb_embed_related_title">
         Buzzfeed
</span>
<div class="trb_embed_related_credit">
         Jay L. Clendenin / Los Angeles Times
</div>
<div class="trb_embed_related_caption">
         Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.
</div>
<div class="trb_embed_related_credit_and_caption">
         Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)
</div>
</div>
</div>

As you can see from the img tag above. It is missing the src attribute, even through it is present in the original HTML source. What am I missing here. Please advise.

Upvotes: 0

Views: 1163

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121864

That's because the original HTML source doesn't contain the src attribute, Javascript adds that attribute after the page is loaded.

The javascript code presumably uses the data-baseurl attribute to generate the src URL, adding a size and ratio.

The delayload and imgsize_item values in the data-role attribute on the parent <figure> tag is a hint there too. You'll have to calculate your own aspect ratio from the given data-content-naturalheight and data-content-naturalwidth attributes and go from there.

If you resize the page you'll see that the site is using a responsive design; different image sizes are loaded based on how much horizontal space is available.

A quick experiment shows that you can fill in any size in the URL, as well as any aspect ratio, and the image is autogenerated based on those.

If you wanted to get the full size image, all you have to do is load the base URL; it returns the un-scaled image.

The javascript used to generate size and ratio picks among 16x9, 1x1 and 9x16 aspect ratios, based on the ratio between height and width from the data attributes:

img = soup.select('div.trb_embed_media img')[0]
width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
ratio = width / float(height)
ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
img_url = '{}/{}/{}'.format(img['data-baseurl'], width, ratio)

For your example that generates http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9, a valid image:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.latimes.com/business/la-fi-tn-buzzfeed-deal-20140811-story.html')
>>> soup = BeautifulSoup(r.content)
>>> img = soup.select('div.trb_embed_media img')[0]
>>> width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
>>> ratio = width / float(height)
>>> ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
>>> '{}/{}/{}'.format(img['data-baseurl'], width, ratio)
'http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9'

Upvotes: 1

Related Questions