ncw
ncw

Reputation: 1733

File extension from MIME type with ;charset=UTF-8

I have a Python web crawler which is downloading files with different extensions. To get the extension from the HTTP header content type, I am using the Python library mimetypes.

http_header = session.head(url, headers={'Accept-Encoding': 'identity'})
extension = mimetypes.guess_extension(http_header.headers['content-type'])

Everything is working fine, except when the HTTP header content type contains
;charset=UTF-8. E.g. mimetypes.guess_extension is returning None for the following examples

content-type: text/plain;charset=UTF-8 # extension should be .txt OR  
content-type: text/x-c;charset=UTF-8   # extension should be .java

Check with mimetypes:

>>> import mimetypes
>>> print(mimetypes.guess_extension('text/plain;charset=UTF-8'))
None
>>> 

Question: How do I handle this and get the correct extension from content-types ending with ;charset=UTF-8?

I guess it is not a good solution to catch such exceptions with an if statement since I never know if the whitelist is complete or whether I am missing some content-type.

Upvotes: 3

Views: 3669

Answers (1)

BernardoGO
BernardoGO

Reputation: 1856

One simple way to deal with that is to split the MIME string and get only the first element.

The following code will return the expected result for both conditions.

http_header = session.head(url, headers={'Accept-Encoding': 'identity'})
extension = mimetypes.guess_extension(http_header.headers['content-type'].split(";")[0])))

Remember it is a guess. You can't expect much from it for such broad definitions such as plain text. It seems like mimetypes.guess_extension() just takes the first element of this list. This is also the reason guessing the mimetype of text/plain returns .h when .txt is the obvious choice.

Upvotes: 1

Related Questions