Reputation: 3692
I'm processing a few webs, so that I have a list with absolute paths to images,with the following code:
for img in images:
try:
if img["src"].startswith("http"):
abs_img_url = img["src"]
else:
abs_img_url = urljoin(url, img["src"])
except KeyError:
# src attribute does not exist
continue
The problem is with this webpage, where I will get a lot of blank.gif
images, though the browser will display other file which is stored in
img["data-original"]
attribute. It's surprising that Firefox inspector
shows the correct image in img["src"]
but when you view the source
you see it in img["data-original"]
.
Could you explain this issue and how would you treat it programaticaly
to detect and download the right image, instead of blank.gif
?
example image element giving bad result:
<img alt="browser cache backend" class="lazy aligncenter size-full wp-image-57323" data-original="http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/uploads/2008/06/browser-cache-backend.jpg" height="190" itemprop="image" sizes="(max-width: 540px) 100vw, 540px" src="http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/themes/online-tech-tips-2013/images/blank.gif" srcset="http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/uploads/2008/06/browser-cache-backend.jpg 540w, http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/uploads/2008/06/browser-cache-backend-300x106.jpg 300w, http://11986-presscdn-0-77.pagely.netdna-cdn.com/wp-content/uploads/2008/06/browser-cache-backend-80x28.jpg 80w" width="540"/>
Upvotes: 0
Views: 395
Reputation: 2033
The thing is that JavaScript is changing the source from the data-original
attribute to the src
attribute dynamically on load, and since BeautifulSoup does not process JS you end up with invalid src
attributes to your images. With this in mind you have 2 options, either parse the data-original
attribute or change your approach to something that process the JS code before you parse the page like Selenium or CasperJS or PhantomJS.
I think that searching for the correct attribute is a good way to go without overcomplicating your scraper.
my_images = []
for img in images:
try:
if img['src'].endswith('blank.gif'):
my_images.append(img['data-original'])
else:
my_images.append(img['src'])
except KeyError:
continue
my_abs_images = [img if img.startswith('http') else urljoin(url, img) for img in my_images]
Upvotes: 1
Reputation: 837
Try adding a conditional for detecting wether if the image is called blank.gif:
for img in images:
try:
if img["src"].startswith("http"):
abs_img_url = img["src"]
if img["src"][-9:]=='blank.gif':
abs_img_url = img["data-original"]
else:
abs_img_url = urljoin(url, img["src"])
except KeyError:
# src attribute does not exist
continue
Upvotes: 1