Anders Ross
Anders Ross

Reputation: 63

Scrape image with no extension

I'm trying to scrape images this site: http://mis.historiska.se/mis/sok/bild.asp?uid=336358&g=1

The site also have the option to download different sizes, like big image here: http://catview.historiska.se/catview/media/highres/336358

I have no problem downloading manual, scraping the image, or even scraping the url, but the image and url is missing the image extension.

I need to scrape the full url with filename and extension., NOT the actual image.

Upvotes: 0

Views: 331

Answers (1)

kreddyio
kreddyio

Reputation: 155

The proper way to do this would be to check the headers after making a request to the given url for the filename and extension. A simple curl request to the given url gives me the following response:

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: image/jpeg
Content-Length: 569050
Date: Wed, 20 Jan 2016 15:33:49 GMT

The best way to guess the file extension would be to just check "Content-Type" header. Similarly, in order to get the filename, we'd be using the "Content-Disposition" header which need not necessarily be provided in the headers in which case we'll need to guess the filename from the URL. A simple python snippet for guessing extension would be as follows:

import requests
import mimetypes
resp = requests.get(url)
content_type = resp.headers['content-type']
ext = mimetypes.guess_extension(content_type)

Upvotes: 1

Related Questions