Nathaniel J. Smith
Nathaniel J. Smith

Reputation: 12722

Why does this regex for HTTP headers fail to match whenever the value contains a single-character token?

Here's a Python regex that I built by transcribing RFC 7230's definition of "header-field", i.e., it's supposed to match things like Connection: close:

rb"^(?P<field_name>[-!#$%&%&'*+.^_`|~0-9a-zA-Z]+):[ \t]*(?P<field_value>([\x21-\xff]([ \t]+[\x21-\xff])?)*)[ \t]*$"

(Looking at RFC 7230 will definitely help make sense of it.)

For some reason I can't fathom, though, it seems to work unless the field value contains a single non-whitespace character that has whitespace on both sides:

In [36]: r = re.compile(rb"^(?P<field_name>[-!#$%&%&'*+.^_`|~0-9a-zA-Z]+):[ \t]*(?P<field_value>([\x21-\xff]([ \t]+[\x21-\xff])?)*)[ \t]*$")

In [38]: r.match(b"Foo: bar")
Out[38]: <_sre.SRE_Match object; span=(0, 8), match=b'Foo: bar'>

In [39]: r.match(b"Foo: bar baz quux")
Out[39]: <_sre.SRE_Match object; span=(0, 17), match=b'Foo: bar baz quux'>

In [40]: r.match(b"Foo: bar baz a quux")

In [41]: r.match(b"Foo: bar baz quux a")
Out[41]: <_sre.SRE_Match object; span=(0, 19), match=b'Foo: bar baz quux a'>

Why does the 3rd example fail to match, while all the others succeed?

Upvotes: 3

Views: 323

Answers (2)

alexwlchan
alexwlchan

Reputation: 6098

When you match a word, you match the first character of the next word at the same time.

Your regex for the field value matches 0 or more copies of this regex:

[\x21-\xff]([ \t]+[\x21-\xff])?

At a word boundary (the part captured by [ \t]+), the optional group goes on the capture the first letter of the next word. If that’s a single letter word, the entire word is consumed, and the next character is whitespace – but it’s expecting to see [\x21-\xff]. So it fails to match.

If a single-char word is the beginning of the string, there’s no preceding word to capture it, so it’s fine. If it’s at the end of the string, it doesn’t matter that we already captured this letter – we skip straight to the end.

I would suggest simplifying the field value group to

(?P<field_value>(?:[\x21-\xff]*[ \t]?)*)

That captures arbitrary runs of \x21-\xff characters, each followed by a single space. (I’ve also added a non-capturing group for tidiness.)

This regex passes all of your original test cases. I haven’t given it much testing, but I think it resolves this particular issue.

Upvotes: 0

anubhava
anubhava

Reputation: 785128

You should be using this regex:

^(?P<field_name>[-!#$%&%&'*+.^_`|~0-9a-zA-Z]+):[ \t]*(?P<field_value>[\x21-\xff]+(?:[ \t]+[\x21-\xff]+)*)[ \t]*$

RegEx Demo

This part of your regex is faulty:

([\x21-\xff]([ \t]+[\x21-\xff])?)*

As it won't match anything after a single letter and optional spaces (before end of line).

Upvotes: 1

Related Questions