Reputation: 53873

How to get pdf filename with Python requests?

I'm using the Python requests library to get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click download it already has a filename defined to save the pdf. How do I get this filename?

For example:

import requests
r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf')
print r.headers['content-type']  # prints 'application/pdf'

I checked the r.headers for anything interesting, but there's no filename in there. I was actually hoping for something like r.filename..

Does anybody know how I can get the filename of a downloaded PDF file with the requests library?

Upvotes: 70

Answers (8)

Stan

Reputation: 350

Using Python's standard library:

from email.message import EmailMessage

msg = EmailMessage()
msg["Content-Disposition"] = response.headers.get("Content-Disposition")
filename = msg.get_filename()

Like others said, the file name is in the "Content-Disposition" header.

The cgi standard library module used to be the way to parse it, but it's deprecated since py311.

The currently recommended way of parsing is using the email module, which is also part of the standard library.

References:

Upvotes: 3

funnydman

Reputation: 11346

According to the documentation, neither Content-Disposition nor its filename attribute is required. Also, I checked dozens links on the internet and haven't found responses with the Content-Disposition header. So, in most cases, I wouldn't rely on it much and just retrieve this information from the request URL (note: I'm taking it from req.url because there could be redirection and we want to get real filename). I used werkzeug because it looks more robust and handles quoted and unquoted filenames. Eventually, I came up with this solution (works since Python 3.8):

from urllib.parse import urlparse

import requests
import werkzeug


def get_filename(url: str):
    try:
        with requests.get(url) as req:
            if content_disposition := req.headers.get("Content-Disposition"):
                param, options = werkzeug.http.parse_options_header(content_disposition)
                if param == 'attachment' and (filename := options.get('filename')):
                    return filename

            path = urlparse(req.url).path
            name = path[path.rfind('/') + 1:]
            return name
    except requests.exceptions.RequestException as e:
        raise e

I wrote some tests using pytest and requests_mock:

import pytest
import requests
import requests_mock

from main import get_filename

TEST_URL = 'https://pwrk.us/report.pdf'


@pytest.mark.parametrize(
    'headers,expected_filename',
    [
        (
                {'Content-Disposition': 'attachment; filename="filename.pdf"'},
                "filename.pdf"
        ),
        (
                # The string following filename should always be put into quotes;
                # but, for compatibility reasons, many browsers try to parse unquoted names that contain spaces.
                # https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition#directives
                {'Content-Disposition': 'attachment; filename=filename with spaces.pdf'},
                "filename with spaces.pdf"
        ),
        (
                {'Content-Disposition': 'attachment;'},
                "report.pdf"
        ),
        (
                {'Content-Disposition': 'inline;'},
                "report.pdf"
        ),
        (
                {},
                "report.pdf"
        )
    ]
)
def test_get_filename(headers, expected_filename):
    with requests_mock.Mocker() as m:
        m.get(TEST_URL, text='resp', headers=headers)
        assert get_filename(TEST_URL) == expected_filename


def test_get_filename_exception():
    with requests_mock.Mocker() as m:
        m.get(TEST_URL, exc=requests.exceptions.RequestException)
        with pytest.raises(requests.exceptions.RequestException):
            get_filename(TEST_URL)

Upvotes: 3

root

Reputation: 2428

Use urllib.request instead of requests because then you can do urllib.request.urlopen(...).headers.get_filename(), which is safer than some of the other answers for the following reason:

If the [Content-Disposition] header does not have a filename parameter, this method falls back to looking for the name parameter on the Content-Type header.

After that, even safer would be to additionally fall back to the filename in the URL, as another answer does.

Upvotes: 6

myildirim

Reputation: 2428

You can use werkzeug for options headers https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

>>> import werkzeug


>>> werkzeug.http.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})

Upvotes: 5

Akhilesh Joshi

Reputation: 362

easy python3 implementation to get filename from Content-Disposition:

import requests
response = requests.get(<your-url>)
print(response.headers.get("Content-Disposition").split("filename=")[1])

Upvotes: 9

Nilpo

Reputation: 4816

Building on some of the other answers, here's how I do it. If there isn't a Content-Disposition header, I parse it from the download URL:

import re
import requests
from requests.exceptions import RequestException


url = 'http://www.example.com/downloads/sample.pdf'

try:
    with requests.get(url) as r:

        fname = ''
        if "Content-Disposition" in r.headers.keys():
            fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
        else:
            fname = url.split("/")[-1]

        print(fname)
except RequestException as e:
    print(e)

There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.

Upvotes: 24

Eugene V

Reputation: 3116

It is specified in an http header content-disposition. So to extract the name you would do:

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]

Name extracted from the string via regular expression (re module).

Upvotes: 99

Maksim Solovjov

Reputation: 3157

Apparently, for this particular resource it is in:

r.headers['content-disposition']

Don't know if it is always the case, though.

Upvotes: 11

How to get pdf filename with Python requests?

Answers (8)

Related Questions