How should I decode HTTP headers from bytes to string?

Question

Basically I'm trying to make a small web server in python from scratch(just to learn) and I'm having issues with decoding the headers. The code boils down to this (I've simplified it to only the code related to the issue):

import socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(('', 80))
sock.listen(1)

while True:
    conn, addr = sock.accept()

    print(addr[0])
    request = conn.recv(2048).decode('utf-8')

    headers = (
        'HTTP/1.0 200 OK',
        'Content-Type: text/html'
    )

    content = 'success'

    response = "
".join(headers) + "

" + content

    conn.sendall(bytes(response, 'UTF-8'))

    conn.close()

I installed the addon HttpRequester for Firefox to fiddle around with what I have currently and tried attaching a file, which led to the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 386: invalid start byte

How do I go about fixing this? Should I wrap the thing in try: and ignore requests which lead to exceptions of that kind?

Martijn Pieters · Accepted Answer

RFC 7230 has this to say about field parsing:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

where RFC 2047 gives you an extension mechanism to use other character sets; these would be encoded to ASCII anyway and require an additional step to decode. Personally, I've never seen such headers actually being used in HTTP communication.

You can thus safely assume all headers can be decoded as Latin-1, and RFC 2047 headers can be dealt with separately later:

request = conn.recv(2048)
headers, sep, body = request.partition(b'

')
headers = headers.decode('latin1')

This does assume that all headers fit in those 2048 bytes.

Latin-1 (ISO-8859-1) decodes bytes directly, one on one, to Unicode code points; even for those fields that should be treated as opaque data can be decoded this way, even though that is probably the wrong codec for those headers. In practice, you probably won't come across such headers and even if you did, you'd not care about those anyway. The headers that matter are all ASCII encoded.

How should I decode HTTP headers from bytes to string?

Answers (1)

Related Questions