jpe2796
jpe2796

Reputation: 31

Extract an image using Python's Beautiful Soup

I used the following code to extract an HTML I need from an Amazon listing:

import requests
from bs4 import BeautifulSoup

r=requests.get("http://www.amazon.com/dp/B0007RXSB4")

soup=BeautifulSoup(r.content)

soup.find_all("div", {"id":"imgTagWrapperId"})

Which gave me this:

[<div class="imgTagWrapper" id="imgTagWrapperId">\n<img alt="Johnston         
&amp; Murphy Men's Greenwich Oxford,Black,6 D" class="a-dynamic-image 
a-stretch-vertical" data-a-dynamic-image='{"http://ecx.images-
amazon.com/images/I/81zwayZox-S._UY695_.jpg":
[695,695],"http://ecx.images-amazon.com/images/I/81zwayZox-
S._UY535_.jpg":[535,535],"http://ecx.images-
amazon.com/images/I/81zwayZox-S._UY500_.jpg":
[500,500],"http://ecx.images-amazon.com/images/I/81zwayZox-
S._UY575_.jpg":[575,575],"http://ecx.images-
amazon.com/images/I/81zwayZox-S._UY395_.jpg":
[395,395],"http://ecx.images-amazon.com/images/I/81zwayZox-
S._UY585_.jpg":[585,585]}' data-old-hires="http://ecx.images-
amazon.com/images/I/81zwayZox-S._UL1500_.jpg" id="landingImage" 
onload="this.onload='';setCSMReq('af');if(typeof addlongPoleTag === 
'function'){ addlongPoleTag('af','desktop-image-atf-
marker');};setCSMReq('cf')" src="http://ecx.images-
amazon.com/images/I/41KixMIlPNL._SY395_QL70_.jpg" style="max-
width:695px;max-height:695px;">\n</img></div>]

I just need to know how to extract

http://ecx.images-amazon.com/images/I/81zwayZox-S._UY695_.jpg

from the code above.

Upvotes: 3

Views: 11291

Answers (1)

alecxe
alecxe

Reputation: 473893

First, you need to find the img tag inside the div you've already found. One way would be to chain the find() calls:

img = soup.find("div", {"id": "imgTagWrapperId"}).find("img")

Or, with a CSS selector:

img = soup.select_one("div#imgTagWrapperId > img")

Then, if you need the image URL in the src attribute:

img["src"]

If you need the image URLs that are inside the data-a-dynamic-image attribute, I suggest you load the value into a python dictionary with json module and get the keys():

import json

img = soup.find("div", {"id": "imgTagWrapperId"}).find("img")
data = json.loads(img["data-a-dynamic-image"])
print(list(data.keys()))

Prints:

[
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY695_.jpg',
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY575_.jpg',     
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY500_.jpg',     
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY395_.jpg',     
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY535_.jpg',     
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY585_.jpg'
]

Upvotes: 6

Related Questions