A. K. Tolentino
A. K. Tolentino

Reputation: 2222

How can I parse the value of Content-Type from an HTTP header response?

My application makes numerous HTTP requests. Without writing a regular expression, how do I parse Content-Type header values? For example:

text/html; charset=UTF-8

For context, here is my code for getting stuff in the Internet:

from requests import head

foo = head("http://www.example.com")

The output I am expecting is similar to what the methods do in mimetools. For example:

x = magic("text/html; charset=UTF-8")

Will output:

x.getparam('charset')  # UTF-8
x.getmaintype()  # text
x.getsubtype()  # HTML

Upvotes: 20

Views: 13515

Answers (5)

awatts
awatts

Reputation: 1067

Building on the answer from @philip-couling we can create just a ContentTypeHeader object and get the MIME or Media Type as well as the parameters

def parse_content_type(content_type):
    from email.policy import EmailPolicy
    header = EmailPolicy.header_factory('content-type', content_type)
    return (header.content_type, dict(header.params))

Upvotes: 2

Philip Couling
Philip Couling

Reputation: 14893

Python has this builtin. It's in the email module.

MIME and mime types are an email standard which have been adopted in other contexts: "Multipurpose Internet Mail Extensions" (see RFC 2045).

The simplest way to do this reliably is to use an email parser:

from email.message import Message

_CONTENT_TYPE = "content-type"

def parse_content_type(content_type: str) -> tuple[str, dict[str,str]]:
    email = Message()
    email[_CONTENT_TYPE] = content_type
    params = email.get_params()
    # The first param is the mime-type the later ones are the attributes like "charset"
    return params[0][0], dict(params[1:])

Upvotes: 9

Owen S.
Owen S.

Reputation: 7855

requests doesn't give you an interface to parse the content type, unfortunately, and the standard library on this stuff is a bit of a mess. So I see two options:

Option 1: Go use the python-mimeparse third-party library.

Option 2: To separate the mime type from options like charset, you can use the same technique that requests uses to parse type/encoding internally: use cgi.parse_header.

response = requests.head('http://example.com')
mimetype, options = cgi.parse_header(response.headers['Content-Type'])

The rest should be simple enough to handle with a split:

maintype, subtype = mimetype.split('/')

Update: As of Mar 2023, the current official way of doing this, now that cgi is deprecated, is using email.message.Message. See Philip Couling's answer. I agree with Philip that it's kind of gross.

Upvotes: 20

user2233709
user2233709

Reputation: 203

Since requests 2.19.0, there is a requests.utils._parse_content_type_header function that splits a Content-Type header into a parameter-less content-type and a dictionary of parameters. This function does not split the content-type into main type and sub-type.

>>> requests.utils._parse_content_type_header("text/html; charset=UTF-8")
('text/html', {'charset': 'UTF-8'})

Note that the name of this function starts with an underscore: it’s supposed to be a private function, so I guess it might be dropped in a future release. For the record, a request to make it a public interface was rejected: https://github.com/psf/requests/issues/6362

Upvotes: 3

lipponen
lipponen

Reputation: 753

Your question is bit unclear. I assume that you are using some sort of web application framework such as Django or Flask?

Here is example how to read Content-Type using Flask:

from flask import Flask, request
app = Flask(__name__)

@app.route("/")
def test():
  request.headers.get('Content-Type')


if __name__ == "__main__":
  app.run()

Upvotes: -1

Related Questions