Reputation: 53873
I'm using the Python requests library to get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click download
it already has a filename defined to save the pdf. How do I get this filename?
For example:
import requests
r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf')
print r.headers['content-type'] # prints 'application/pdf'
I checked the r.headers
for anything interesting, but there's no filename in there. I was actually hoping for something like r.filename
..
Does anybody know how I can get the filename of a downloaded PDF file with the requests library?
Upvotes: 70
Views: 76004
Reputation: 350
Using Python's standard library:
from email.message import EmailMessage
msg = EmailMessage()
msg["Content-Disposition"] = response.headers.get("Content-Disposition")
filename = msg.get_filename()
Like others said, the file name is in the "Content-Disposition" header.
The cgi
standard library module used to be the way to parse it, but it's deprecated since py311
.
The currently recommended way of parsing is using the email
module, which is also part of the standard library.
References:
Upvotes: 3
Reputation: 11346
According to the documentation, neither Content-Disposition
nor its filename
attribute is required. Also, I checked dozens links on the internet and haven't found responses with the Content-Disposition
header. So, in most cases, I wouldn't rely on it much and just retrieve this information from the request URL (note: I'm taking it from req.url
because there could be redirection and we want to get real filename). I used werkzeug
because it looks more robust and handles quoted and unquoted filenames. Eventually, I came up with this solution (works since Python 3.8):
from urllib.parse import urlparse
import requests
import werkzeug
def get_filename(url: str):
try:
with requests.get(url) as req:
if content_disposition := req.headers.get("Content-Disposition"):
param, options = werkzeug.http.parse_options_header(content_disposition)
if param == 'attachment' and (filename := options.get('filename')):
return filename
path = urlparse(req.url).path
name = path[path.rfind('/') + 1:]
return name
except requests.exceptions.RequestException as e:
raise e
I wrote some tests using pytest
and requests_mock
:
import pytest
import requests
import requests_mock
from main import get_filename
TEST_URL = 'https://pwrk.us/report.pdf'
@pytest.mark.parametrize(
'headers,expected_filename',
[
(
{'Content-Disposition': 'attachment; filename="filename.pdf"'},
"filename.pdf"
),
(
# The string following filename should always be put into quotes;
# but, for compatibility reasons, many browsers try to parse unquoted names that contain spaces.
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition#directives
{'Content-Disposition': 'attachment; filename=filename with spaces.pdf'},
"filename with spaces.pdf"
),
(
{'Content-Disposition': 'attachment;'},
"report.pdf"
),
(
{'Content-Disposition': 'inline;'},
"report.pdf"
),
(
{},
"report.pdf"
)
]
)
def test_get_filename(headers, expected_filename):
with requests_mock.Mocker() as m:
m.get(TEST_URL, text='resp', headers=headers)
assert get_filename(TEST_URL) == expected_filename
def test_get_filename_exception():
with requests_mock.Mocker() as m:
m.get(TEST_URL, exc=requests.exceptions.RequestException)
with pytest.raises(requests.exceptions.RequestException):
get_filename(TEST_URL)
Upvotes: 3
Reputation: 2428
Use urllib.request
instead of requests
because then you can do urllib.request.urlopen(
...).
headers
.
get_filename()
, which is safer than some of the other answers for the following reason:
If the [Content-Disposition] header does not have a
filename
parameter, this method falls back to looking for thename
parameter on the Content-Type header.
After that, even safer would be to additionally fall back to the filename in the URL, as another answer does.
Upvotes: 6
Reputation: 2428
You can use werkzeug
for options headers https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header
>>> import werkzeug
>>> werkzeug.http.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})
Upvotes: 5
Reputation: 362
easy python3 implementation to get filename from Content-Disposition:
import requests
response = requests.get(<your-url>)
print(response.headers.get("Content-Disposition").split("filename=")[1])
Upvotes: 9
Reputation: 4816
Building on some of the other answers, here's how I do it. If there isn't a Content-Disposition
header, I parse it from the download URL:
import re
import requests
from requests.exceptions import RequestException
url = 'http://www.example.com/downloads/sample.pdf'
try:
with requests.get(url) as r:
fname = ''
if "Content-Disposition" in r.headers.keys():
fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
else:
fname = url.split("/")[-1]
print(fname)
except RequestException as e:
print(e)
There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.
Upvotes: 24
Reputation: 3116
It is specified in an http header content-disposition
. So to extract the name you would do:
import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]
Name extracted from the string via regular expression (re
module).
Upvotes: 99
Reputation: 3157
Apparently, for this particular resource it is in:
r.headers['content-disposition']
Don't know if it is always the case, though.
Upvotes: 11