stwhite
stwhite

Reputation: 3275

Python3 Download Incorrectly Encoded Image From URL

The problem I am currently having is trying to download an image that displays as an animated gif, but appears encoded as a jpg. I say that it appears to be encoded as a jpg because the file extension and mime-type are both .jpg add image/jpeg.

When downloading the file to my local machine (Mac OSX), then attempting to open the file I get the error:

The file could not be opened. It may be damaged or use a file format that Preview doesn’t recognize.

While I realize that some people would maybe just ignore that image, if it can be fixed, I'm looking for a solution to do that, not just ignore it.

The url in question is here:

http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg

Here is my code, and I am open to suggestions:

from PIL import Image
import requests

response = requests.get(media, stream = True)
response.raise_for_status()

with open(uploadedFile, 'wb') as img:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk:
            img.write(chunk) 
    img.close()

Upvotes: 1

Views: 453

Answers (2)

stwhite
stwhite

Reputation: 3275

Had to answer my own question in this case, but the answer to this problem, was to add a referer for the request. Most likely an htaccess file preventing some direct file access on the image's server unless the request came from their own server.

from fake_useragent import UserAgent
from io import StringIO,BytesIO
import io
import imghdr
import requests

# Set url
mediaURL = 'http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg'

# Create a user agent
ua = UserAgent()

# Create a request session
s = requests.Session()

# Set some headers for the request
s.headers.update({ 'User-Agent': ua.chrome, 'Referrer': media })


# Make the request to get the image from the url
response = s.get(mediaURL, allow_redirects=False)


# The request was about to be redirected
if response.status_code == 302:

    # Get the next location that we would have been redirected to
    location = response.headers['Location']

    # Set the previous page url as referer
    s.headers.update({'referer': location})

    # Try the request again, this time with a referer
    response = s.get(mediaURL, allow_redirects=False, cookies=response.cookies)

    print(response.headers)

Hat tip to @raratiru for suggesting the use of allow_redirects.

Also noted in their answer is that the image's server might be intentionally blocking access to prevent general scrapers from viewing their images. Hard to tell, but regardless, this solution works.

Upvotes: 1

raratiru
raratiru

Reputation: 9616

According to Wheregoes, the link of the image:

  • http://www.supergrove.com/wp-content/uploads/2017/03/gif-images-22-1000-about-gif-on-pinterest.jpg

receives a 302 redirect to the page that contains it:

  • http://www.supergrove.com/gif-images/gif-images-22-1000-about-gif-on-pinterest/

Therefore, your code is trying to download a web page as an image.

I tried:

r = requests.get(the_url, headers=headers, allow_redirects=False)

But it returns zero content and status_code = 302.

(Indeed that was obvious it should happen ...)

This server is configured in a way that it will never fulfill that request.

Bypassing that limitation sounds illegal difficult, to the best of my -limited perhaps- knowledge.

Upvotes: 1

Related Questions