Why does this regex for HTTP headers fail to match whenever the value contains a single-character token?

Question

Here's a Python regex that I built by transcribing RFC 7230's definition of "header-field", i.e., it's supposed to match things like Connection: close:

rb"^(?P[-!#$%&%&'*+.^_`|~0-9a-zA-Z]+):[ 	]*(?P([\x21-\xff]([ 	]+[\x21-\xff])?)*)[ 	]*$"

(Looking at RFC 7230 will definitely help make sense of it.)

For some reason I can't fathom, though, it seems to work unless the field value contains a single non-whitespace character that has whitespace on both sides:

In [36]: r = re.compile(rb"^(?P[-!#$%&%&'*+.^_`|~0-9a-zA-Z]+):[ 	]*(?P([\x21-\xff]([ 	]+[\x21-\xff])?)*)[ 	]*$")

In [38]: r.match(b"Foo: bar")
Out[38]: <_sre.SRE_Match object; span=(0, 8), match=b'Foo: bar'>

In [39]: r.match(b"Foo: bar baz quux")
Out[39]: <_sre.SRE_Match object; span=(0, 17), match=b'Foo: bar baz quux'>

In [40]: r.match(b"Foo: bar baz a quux")

In [41]: r.match(b"Foo: bar baz quux a")
Out[41]: <_sre.SRE_Match object; span=(0, 19), match=b'Foo: bar baz quux a'>

Why does the 3rd example fail to match, while all the others succeed?

anubhava · Accepted Answer

You should be using this regex:

^(?P[-!#$%&%&'*+.^_`|~0-9a-zA-Z]+):[ 	]*(?P[\x21-\xff]+(?:[ 	]+[\x21-\xff]+)*)[ 	]*$

RegEx Demo

This part of your regex is faulty:

([\x21-\xff]([ 	]+[\x21-\xff])?)*

As it won't match anything after a single letter and optional spaces (before end of line).

Why does this regex for HTTP headers fail to match whenever the value contains a single-character token?

Answers (2)

Related Questions