Omid Ketabollahi
Omid Ketabollahi

Reputation: 154

How to check if a URL is downloadable in requests

I am making this downloader app using tkinter and requests and I recently found a bug in my program. Basically I want my program to check whether the given URL is downloadable or not before starting the download of the content of the URL. I used to do this by getting the headers of the URL and checking if 'Content-Length' exists and it works for some URLs (like: https://www.google.com) but for the others (like the link to a youtube video) it does not and it makes my program crash. I saw that someone said one stackoverflow that I could check for 'attachment' in 'Content-Disposition' of the headers but it didn't work for me and returned the same thing for a downloadable and a non-downloadable URL. What is the best way to do this? The code mentioned in the other stackoverflow issue that I tried and did not work:

import requests
url = 'https://www.google.com'
headers=requests.head(url).headers
downloadable = 'attachment' in headers.get('Content-Disposition', '')

My former code:

headers = requests.head(url, headers={'accept-encoding': ''}).headers
try:
    print(type(headers['Content-Length']))
    file_size = int(headers['Content-Length'])
except KeyError:
    # Just a class that I defined to raise an exception if the URL was not downloadable
    raise NotDownloadable()

UPDATE: URL: https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0 This URL is the one I used for testing. If you open the URL it directly leads you to a video which you can download but when checking for the 'Content-Disposition' it returned 'None' just like the majority of the downloadable and non-downloadable URLs I have tried.

Upvotes: 4

Views: 4827

Answers (4)

Life is complex
Life is complex

Reputation: 15639

According to Request for Comment (RFC) 6266 the Content-Disposition Header Field:

is not part of the HTTP standard, but since it is widely implemented, we are documenting its use and risks for implementers.

Since the Content-Disposition header is not always available, you could use a solution that not only looks for that specific header, but also looks at the individual file types within the Content-Type header

Here is a list of Content-Types.

The code below checks the headers for Content-Disposition, but it also checks the headers for some of the Content-Type that are commonly downloadable.

I also added a check for the Content-Length, because it could be useful in chunking the file being downloaded.

Have you considered creating sub-download folders?

  • download_folder/text_files
  • download_folder/pdf_files

or

  • download_folder/01242021/text_files
  • download_folder/01242021/pdf_files
import requests

urls = ['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial'
        '-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-csv.csv',
        'http://www.pdf995.com/samples/pdf.pdf', 'https://jeroen.github.io/files/sample.rtf',
        'https://www.cnn.com/2021/01/23/opinions/biden-climate-change-gillette-wyoming-coal-sutter/index.html',
        'https://www.google.com',
        'https://thumbs-prod.si-cdn.com/d4e3zqOM5KUq8m0m-AFVxuqa5ZM=/800x600/filters:no_upscale():focal(554x699:555x700)/https://public-media.si-cdn.com/filer/a4/04/a404c799-7118-459a-8de4-89e4a44b124f/img_1317.jpg',
        'https://www.blank.org']

for url in urls:
    headers = requests.head(url).headers
    Content_Length = [value for key, value in headers.items() if key == 'Content-Length']
    if len(Content_Length) > 0:
        Content_Size = ''.join(map(str, Content_Length))
    else:
        Content_Size = 'The content size was not available.'


    Content_Disposition_Exists = bool({key: value for key, value in headers.items() if key == 'Content_Disposition'})
    if Content_Disposition_Exists is True:
        # do something with the file
       pass
    else:
        Content_Type = {value for key, value in headers.items() if key == 'Content-Type'}

        compression_formats = ['application/gzip', 'application/vnd.rar', 'application/x-7z-compressed',
                               'application/zip', 'application/x-tar']
        compressed_file = bool([file_format for file_format in compression_formats if file_format in Content_Type])

        image_formats = ['image/bmp', 'image/gif', 'image/jpeg', 'image/png', 'image/svg+xml', 'image/tiff',
                         'image/webp']
        image_file = bool([file_format for file_format in image_formats if file_format in Content_Type])

        text_formats = ['application/rtf', 'text/plain']
        text_file = bool([file_format for file_format in text_formats if file_format in Content_Type])

        if compressed_file is True:
            print('Compressed file')
            print(Content_Size)
        elif image_file is True:
            print('Image file')
            print(Content_Size)
        elif text_file is True:
            print('Text file')
             print(Content_Size)
        elif 'application/pdf' in Content_Type:
            print('PDF file')
            print(Content_Size)
        elif 'text/csv' in Content_Type:
            print('CSV File')
            print(Content_Size)

Here is another version with Functions

import requests

urls = ['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial'
        '-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-csv.csv',
        'http://www.pdf995.com/samples/pdf.pdf', 'https://jeroen.github.io/files/sample.rtf',
        'https://www.cnn.com/2021/01/23/opinions/biden-climate-change-gillette-wyoming-coal-sutter/index.html',
        'https://www.google.com',
        'https://thumbs-prod.si-cdn.com/d4e3zqOM5KUq8m0m-AFVxuqa5ZM=/800x600/filters:no_upscale():focal(554x699:555x700)/https://public-media.si-cdn.com/filer/a4/04/a404c799-7118-459a-8de4-89e4a44b124f/img_1317.jpg',
        'https://www.blank.org']


def query_headers(webpage):
    response = requests.get(webpage, stream=True)
    headers = response.headers
    file_name = webpage.rsplit('/', 1)[-1]

    Content_Disposition_Exists = bool({key: value for key, value in headers.items() if key == 'Content_Disposition'})
    if Content_Disposition_Exists is True:
        # do something with the file
        pass
    else:
        Content_Type = {value for key, value in headers.items() if key == 'Content-Type'}

        compression_formats = ['application/gzip', 'application/vnd.rar', 'application/x-7z-compressed',
                               'application/zip', 'application/x-tar']
        compressed_file = bool([file_format for file_format in compression_formats if file_format in Content_Type])

        image_formats = ['image/bmp', 'image/gif', 'image/jpeg', 'image/png', 'image/svg+xml', 'image/tiff',
                         'image/webp']
        image_file = bool([file_format for file_format in image_formats if file_format in Content_Type])

        text_formats = ['application/rtf', 'text/plain']
        text_file = bool([file_format for file_format in text_formats if file_format in Content_Type])
        nl = '\n'

        if compressed_file is True:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: Compressed file, File size: {content_size}, File name: {file_name}'
        elif image_file is True:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: Image file, File size: {content_size}, File name: {file_name}'
        elif text_file is True:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: Text file, File size: {content_size}, File name: {file_name}'
        elif 'application/pdf' in Content_Type:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: PDF file, File size: {content_size}, File name: {file_name}'
        elif 'text/csv' in Content_Type:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: CSV file, File size: {content_size}, File name: {file_name}'
        elif 'text/html' in "".join(str(Content_Type)):
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: HTML file, File size: {content_size}, File name: {file_name}'
        else:
            content_size = get_content_size(headers)
            return f'File Information: file_type:  no file type found, File size: {content_size}, File name: {file_name}'


def get_content_size(headers):
    Content_Length = [value for key, value in headers.items() if key == 'Content-Length']
    if len(Content_Length) > 0:
        Content_Size = ''.join(map(str, Content_Length))
        return int(Content_Size)
    else:
        return 0


def download_file(filename, file_stream):
    with open(f'{filename}', 'wb') as f:
        f.write(file_stream.content)


for url in urls:
    download_info = query_headers(url)
    print(download_info)
    # output
    File Information: file_type: CSV file, File size: 253178, File name: annual-enterprise-survey-2019-financial-year-provisional-csv.csv
    File Information: file_type: PDF file, File size: 433994, File name: pdf.pdf
    File Information: file_type: Text file, File size: 9636, File name: sample.rtf
    File Information: file_type: HTML file, File size: 185243, File name: index.html
    File Information: file_type: HTML file, File size: 0, File name: www.google.com
    File Information: file_type: Image file, File size: 78868, File name: img_1317.jpg
    File Information: file_type: HTML file, File size: 170, File name: www.blank.org

Upvotes: 4

Ajay
Ajay

Reputation: 5347

I think your former code works but with a slight modification. It's trying to download the complete file due to which it's getting hanged every time you run

import requests
url = 'https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0'
r = requests.get(url,stream=True)


try:
    print(r.headers)
    #if "Content-Length" in r.headers:
    file_size = int(r.headers["Content-Length"])
except KeyError:
    # Just a class that I defined to raise an exception if the URL was not downloadable
    raise NotDownloadable()

Use stream=True

r = requests.get(url,stream=True)

This is not explained in user documentation. But by a guess we can say , chunked transfer encoding is being done, the data stream is divided into a series of non-overlapping "chunks". The chunks are sent out independently of one another by the server.

Upvotes: 1

GAP2002
GAP2002

Reputation: 979

You could check the content-type response header. This header defines the media type of the requested resource. The most common types are shown here.

The content-type header is defined by type "/" subtype and some also include a parameter giving it the format type "/" subtype ";" parameter, with parameter in the form attribute "=" value. The parameter value is not mandatory but type and subtype are.

There are currently 7 types as defined by RFC 134:

text multipart application message image audio video

The header you are looking for varies with the resource you are expecting, but some examples you may use.

Examples

Download an image

import requests

response = requests.head(url)
response_headers = response.headers
response_content_type = response_headers.get("content-type")

# you could use this code to search for all images using just the type

if response_content_type.lower().split("/")[0] == "image":
    is_image = True
else:
    is_image = False

# alternatively you could specify your expected content-types including the subtype

CONTENT_TYPES = ["image/gif", "image/jpeg", "image/png", "image/tiff", "image.svg+xml"...]

if response_content_type.lower() in CONTENT_TYPES:
    is_image = True
else:
    is_image = False

if is_image:
    # code to download image

This code could easily be adapted for different types and subtypes.

Note

It is worth noting the types are fixed, you cannot define a new subtype but you can define a new subtype.

Upvotes: 0

RJ Adriaansen
RJ Adriaansen

Reputation: 9649

Content-Disposition provides filename information if it is not given in the url. But this information is not always present, as is the case with your url. A solution is to filter by content type, see the example below. You can add filters if you wish to download specific content types such as video/mp4.

import requests

url = 'https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0'
headers=requests.head(url, allow_redirects=True).headers
content_type = headers.get('content-type')

if 'text' in content_type.lower():
    downloadable = False
elif 'html' in content_type.lower():
    downloadable =  False
else:
    downloadable = True

print(downloadable)

Upvotes: 1

Related Questions