Reputation: 1733
I have a Python web crawler which is downloading files with different extensions. To get the extension from the HTTP header content type, I am using the Python library mimetypes.
http_header = session.head(url, headers={'Accept-Encoding': 'identity'})
extension = mimetypes.guess_extension(http_header.headers['content-type'])
Everything is working fine, except when the HTTP header content type contains
;charset=UTF-8
. E.g. mimetypes.guess_extension
is returning None
for the following examples
content-type: text/plain;charset=UTF-8 # extension should be .txt OR
content-type: text/x-c;charset=UTF-8 # extension should be .java
Check with mimetypes:
>>> import mimetypes
>>> print(mimetypes.guess_extension('text/plain;charset=UTF-8'))
None
>>>
Question: How do I handle this and get the correct extension from content-types ending with ;charset=UTF-8
?
I guess it is not a good solution to catch such exceptions with an if statement since I never know if the whitelist is complete or whether I am missing some content-type.
Upvotes: 3
Views: 3669
Reputation: 1856
One simple way to deal with that is to split the MIME string and get only the first element.
The following code will return the expected result for both conditions.
http_header = session.head(url, headers={'Accept-Encoding': 'identity'})
extension = mimetypes.guess_extension(http_header.headers['content-type'].split(";")[0])))
Remember it is a guess. You can't expect much from it for such broad definitions such as plain text. It seems like mimetypes.guess_extension() just takes the first element of this list. This is also the reason guessing the mimetype of text/plain returns .h when .txt is the obvious choice.
Upvotes: 1