Reputation: 830
I'm having difficulty extracting an image src using Python 2.7, beautifulsoup4 (4.2.1).
The HTML section I am interested in is:
<div class="trb_embed_media "> <figure imgratio="16x9" imgwidth="750" imgheight="450" data-role="imgsize_item" class="trb_embed_imageContainer_figure"><img src="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/750/16x9" data-height="450" data-width="750" data-ratio="16x9" itemprop="image" data-baseurl="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007" alt="Buzzfeed" class="trb_embed_imageContainer_img" title="Buzzfeed" data-content-naturalwidth="2048" data-content-naturalheight="1365"></figure><div class="trb_embed_related" data-role="lightbox_metadata"> <span class="trb_embed_related_title">Buzzfeed</span> <div class="trb_embed_related_credit">Jay L. Clendenin / Los Angeles Times</div> <div class="trb_embed_related_caption">Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.</div> <div class="trb_embed_related_credit_and_caption">Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)</div></div> </div>
The code that I am running is:
image_section = soup.find(class_ = "trb_embed_media")
print image_section
print "================="
img = image_section.find('img')['src']
print img
The output of line 2 of the code above is:
<div class="trb_embed_media ">
<figure class="trb_embed_imageContainer_figure" data-role=" delayload delayload_done imgsize_item">
<img alt="Buzzfeed" class="trb_embed_imageContainer_img" data-baseurl="http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007" data-content-naturalheight="1365" data-content-naturalwidth="2048" itemprop="image" title="Buzzfeed"/>
</figure>
<div class="trb_embed_related" data-role="lightbox_metadata">
<span class="trb_embed_related_title">
Buzzfeed
</span>
<div class="trb_embed_related_credit">
Jay L. Clendenin / Los Angeles Times
</div>
<div class="trb_embed_related_caption">
Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013.
</div>
<div class="trb_embed_related_credit_and_caption">
Buzzfeed's Los Angeles headquarters on Beverly Boulevard on Oct. 7, 2013. (Jay L. Clendenin / Los Angeles Times)
</div>
</div>
</div>
As you can see from the img tag above. It is missing the src attribute, even through it is present in the original HTML source. What am I missing here. Please advise.
Upvotes: 0
Views: 1163
Reputation: 1121864
That's because the original HTML source doesn't contain the src
attribute, Javascript adds that attribute after the page is loaded.
The javascript code presumably uses the data-baseurl
attribute to generate the src
URL, adding a size and ratio.
The delayload
and imgsize_item
values in the data-role
attribute on the parent <figure>
tag is a hint there too. You'll have to calculate your own aspect ratio from the given data-content-naturalheight
and data-content-naturalwidth
attributes and go from there.
If you resize the page you'll see that the site is using a responsive design; different image sizes are loaded based on how much horizontal space is available.
A quick experiment shows that you can fill in any size in the URL, as well as any aspect ratio, and the image is autogenerated based on those.
If you wanted to get the full size image, all you have to do is load the base URL; it returns the un-scaled image.
The javascript used to generate size and ratio picks among 16x9
, 1x1
and 9x16
aspect ratios, based on the ratio between height and width from the data attributes:
img = soup.select('div.trb_embed_media img')[0]
width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
ratio = width / float(height)
ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
img_url = '{}/{}/{}'.format(img['data-baseurl'], width, ratio)
For your example that generates http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9, a valid image:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.latimes.com/business/la-fi-tn-buzzfeed-deal-20140811-story.html')
>>> soup = BeautifulSoup(r.content)
>>> img = soup.select('div.trb_embed_media img')[0]
>>> width, height = map(int, (img['data-content-naturalwidth'], img['data-content-naturalheight']))
>>> ratio = width / float(height)
>>> ratio = '1x1' if 0.9 <= ratio <= 1.1 else '16x9' if ratio > 1.1 else '9x16'
>>> '{}/{}/{}'.format(img['data-baseurl'], width, ratio)
'http://www.trbimg.com/img-53e8dc49/turbine/lat-buzzfeed-la0011761750-20131007/2048/16x9'
Upvotes: 1