Reputation: 63
I'm trying to scrape images this site: http://mis.historiska.se/mis/sok/bild.asp?uid=336358&g=1
The site also have the option to download different sizes, like big image here: http://catview.historiska.se/catview/media/highres/336358
I have no problem downloading manual, scraping the image, or even scraping the url, but the image and url is missing the image extension.
I need to scrape the full url with filename and extension., NOT the actual image.
Upvotes: 0
Views: 331
Reputation: 155
The proper way to do this would be to check the headers after making a request to the given url for the filename and extension. A simple curl request to the given url gives me the following response:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: image/jpeg
Content-Length: 569050
Date: Wed, 20 Jan 2016 15:33:49 GMT
The best way to guess the file extension would be to just check "Content-Type" header. Similarly, in order to get the filename, we'd be using the "Content-Disposition" header which need not necessarily be provided in the headers in which case we'll need to guess the filename from the URL. A simple python snippet for guessing extension would be as follows:
import requests
import mimetypes
resp = requests.get(url)
content_type = resp.headers['content-type']
ext = mimetypes.guess_extension(content_type)
Upvotes: 1