Reputation: 154
I am making this downloader app using tkinter and requests and I recently found a bug in my program. Basically I want my program to check whether the given URL is downloadable or not before starting the download of the content of the URL. I used to do this by getting the headers of the URL and checking if 'Content-Length' exists and it works for some URLs (like: https://www.google.com) but for the others (like the link to a youtube video) it does not and it makes my program crash. I saw that someone said one stackoverflow that I could check for 'attachment' in 'Content-Disposition' of the headers but it didn't work for me and returned the same thing for a downloadable and a non-downloadable URL. What is the best way to do this? The code mentioned in the other stackoverflow issue that I tried and did not work:
import requests
url = 'https://www.google.com'
headers=requests.head(url).headers
downloadable = 'attachment' in headers.get('Content-Disposition', '')
My former code:
headers = requests.head(url, headers={'accept-encoding': ''}).headers
try:
print(type(headers['Content-Length']))
file_size = int(headers['Content-Length'])
except KeyError:
# Just a class that I defined to raise an exception if the URL was not downloadable
raise NotDownloadable()
UPDATE: URL: https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0 This URL is the one I used for testing. If you open the URL it directly leads you to a video which you can download but when checking for the 'Content-Disposition' it returned 'None' just like the majority of the downloadable and non-downloadable URLs I have tried.
Upvotes: 4
Views: 4827
Reputation: 15639
According to Request for Comment (RFC) 6266 the Content-Disposition Header Field:
is not part of the HTTP standard, but since it is widely implemented, we are documenting its use and risks for implementers.
Since the Content-Disposition header is not always available, you could use a solution that not only looks for that specific header, but also looks at the individual file types within the Content-Type header
Here is a list of Content-Types.
The code below checks the headers for Content-Disposition, but it also checks the headers for some of the Content-Type that are commonly downloadable.
I also added a check for the Content-Length, because it could be useful in chunking the file being downloaded.
Have you considered creating sub-download folders?
or
import requests
urls = ['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial'
'-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-csv.csv',
'http://www.pdf995.com/samples/pdf.pdf', 'https://jeroen.github.io/files/sample.rtf',
'https://www.cnn.com/2021/01/23/opinions/biden-climate-change-gillette-wyoming-coal-sutter/index.html',
'https://www.google.com',
'https://thumbs-prod.si-cdn.com/d4e3zqOM5KUq8m0m-AFVxuqa5ZM=/800x600/filters:no_upscale():focal(554x699:555x700)/https://public-media.si-cdn.com/filer/a4/04/a404c799-7118-459a-8de4-89e4a44b124f/img_1317.jpg',
'https://www.blank.org']
for url in urls:
headers = requests.head(url).headers
Content_Length = [value for key, value in headers.items() if key == 'Content-Length']
if len(Content_Length) > 0:
Content_Size = ''.join(map(str, Content_Length))
else:
Content_Size = 'The content size was not available.'
Content_Disposition_Exists = bool({key: value for key, value in headers.items() if key == 'Content_Disposition'})
if Content_Disposition_Exists is True:
# do something with the file
pass
else:
Content_Type = {value for key, value in headers.items() if key == 'Content-Type'}
compression_formats = ['application/gzip', 'application/vnd.rar', 'application/x-7z-compressed',
'application/zip', 'application/x-tar']
compressed_file = bool([file_format for file_format in compression_formats if file_format in Content_Type])
image_formats = ['image/bmp', 'image/gif', 'image/jpeg', 'image/png', 'image/svg+xml', 'image/tiff',
'image/webp']
image_file = bool([file_format for file_format in image_formats if file_format in Content_Type])
text_formats = ['application/rtf', 'text/plain']
text_file = bool([file_format for file_format in text_formats if file_format in Content_Type])
if compressed_file is True:
print('Compressed file')
print(Content_Size)
elif image_file is True:
print('Image file')
print(Content_Size)
elif text_file is True:
print('Text file')
print(Content_Size)
elif 'application/pdf' in Content_Type:
print('PDF file')
print(Content_Size)
elif 'text/csv' in Content_Type:
print('CSV File')
print(Content_Size)
Here is another version with Functions
import requests
urls = ['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial'
'-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-csv.csv',
'http://www.pdf995.com/samples/pdf.pdf', 'https://jeroen.github.io/files/sample.rtf',
'https://www.cnn.com/2021/01/23/opinions/biden-climate-change-gillette-wyoming-coal-sutter/index.html',
'https://www.google.com',
'https://thumbs-prod.si-cdn.com/d4e3zqOM5KUq8m0m-AFVxuqa5ZM=/800x600/filters:no_upscale():focal(554x699:555x700)/https://public-media.si-cdn.com/filer/a4/04/a404c799-7118-459a-8de4-89e4a44b124f/img_1317.jpg',
'https://www.blank.org']
def query_headers(webpage):
response = requests.get(webpage, stream=True)
headers = response.headers
file_name = webpage.rsplit('/', 1)[-1]
Content_Disposition_Exists = bool({key: value for key, value in headers.items() if key == 'Content_Disposition'})
if Content_Disposition_Exists is True:
# do something with the file
pass
else:
Content_Type = {value for key, value in headers.items() if key == 'Content-Type'}
compression_formats = ['application/gzip', 'application/vnd.rar', 'application/x-7z-compressed',
'application/zip', 'application/x-tar']
compressed_file = bool([file_format for file_format in compression_formats if file_format in Content_Type])
image_formats = ['image/bmp', 'image/gif', 'image/jpeg', 'image/png', 'image/svg+xml', 'image/tiff',
'image/webp']
image_file = bool([file_format for file_format in image_formats if file_format in Content_Type])
text_formats = ['application/rtf', 'text/plain']
text_file = bool([file_format for file_format in text_formats if file_format in Content_Type])
nl = '\n'
if compressed_file is True:
download_file(file_name, response)
content_size = get_content_size(headers)
return f'File Information: file_type: Compressed file, File size: {content_size}, File name: {file_name}'
elif image_file is True:
download_file(file_name, response)
content_size = get_content_size(headers)
return f'File Information: file_type: Image file, File size: {content_size}, File name: {file_name}'
elif text_file is True:
download_file(file_name, response)
content_size = get_content_size(headers)
return f'File Information: file_type: Text file, File size: {content_size}, File name: {file_name}'
elif 'application/pdf' in Content_Type:
download_file(file_name, response)
content_size = get_content_size(headers)
return f'File Information: file_type: PDF file, File size: {content_size}, File name: {file_name}'
elif 'text/csv' in Content_Type:
download_file(file_name, response)
content_size = get_content_size(headers)
return f'File Information: file_type: CSV file, File size: {content_size}, File name: {file_name}'
elif 'text/html' in "".join(str(Content_Type)):
download_file(file_name, response)
content_size = get_content_size(headers)
return f'File Information: file_type: HTML file, File size: {content_size}, File name: {file_name}'
else:
content_size = get_content_size(headers)
return f'File Information: file_type: no file type found, File size: {content_size}, File name: {file_name}'
def get_content_size(headers):
Content_Length = [value for key, value in headers.items() if key == 'Content-Length']
if len(Content_Length) > 0:
Content_Size = ''.join(map(str, Content_Length))
return int(Content_Size)
else:
return 0
def download_file(filename, file_stream):
with open(f'{filename}', 'wb') as f:
f.write(file_stream.content)
for url in urls:
download_info = query_headers(url)
print(download_info)
# output
File Information: file_type: CSV file, File size: 253178, File name: annual-enterprise-survey-2019-financial-year-provisional-csv.csv
File Information: file_type: PDF file, File size: 433994, File name: pdf.pdf
File Information: file_type: Text file, File size: 9636, File name: sample.rtf
File Information: file_type: HTML file, File size: 185243, File name: index.html
File Information: file_type: HTML file, File size: 0, File name: www.google.com
File Information: file_type: Image file, File size: 78868, File name: img_1317.jpg
File Information: file_type: HTML file, File size: 170, File name: www.blank.org
Upvotes: 4
Reputation: 5347
I think your former code works but with a slight modification. It's trying to download the complete file due to which it's getting hanged every time you run
import requests
url = 'https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0'
r = requests.get(url,stream=True)
try:
print(r.headers)
#if "Content-Length" in r.headers:
file_size = int(r.headers["Content-Length"])
except KeyError:
# Just a class that I defined to raise an exception if the URL was not downloadable
raise NotDownloadable()
Use stream=True
r = requests.get(url,stream=True)
This is not explained in user documentation. But by a guess we can say , chunked transfer encoding is being done, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out independently of one another by the server.
Upvotes: 1
Reputation: 979
You could check the content-type response header. This header defines the media type of the requested resource. The most common types are shown here.
The content-type
header is defined by type "/" subtype
and some also include a parameter giving it the format type "/" subtype ";" parameter
, with parameter in the form attribute "=" value
. The parameter value is not mandatory but type and subtype are.
There are currently 7 types as defined by RFC 134:
text multipart application message image audio video
The header you are looking for varies with the resource you are expecting, but some examples you may use.
Examples
Download an image
import requests
response = requests.head(url)
response_headers = response.headers
response_content_type = response_headers.get("content-type")
# you could use this code to search for all images using just the type
if response_content_type.lower().split("/")[0] == "image":
is_image = True
else:
is_image = False
# alternatively you could specify your expected content-types including the subtype
CONTENT_TYPES = ["image/gif", "image/jpeg", "image/png", "image/tiff", "image.svg+xml"...]
if response_content_type.lower() in CONTENT_TYPES:
is_image = True
else:
is_image = False
if is_image:
# code to download image
This code could easily be adapted for different types and subtypes.
Note
It is worth noting the types are fixed, you cannot define a new subtype but you can define a new subtype.
Upvotes: 0
Reputation: 9649
Content-Disposition provides filename information if it is not given in the url. But this information is not always present, as is the case with your url. A solution is to filter by content type, see the example below. You can add filters if you wish to download specific content types such as video/mp4
.
import requests
url = 'https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0'
headers=requests.head(url, allow_redirects=True).headers
content_type = headers.get('content-type')
if 'text' in content_type.lower():
downloadable = False
elif 'html' in content_type.lower():
downloadable = False
else:
downloadable = True
print(downloadable)
Upvotes: 1