Reputation: 12722
Here's a Python regex that I built by transcribing RFC 7230's definition of "header-field", i.e., it's supposed to match things like Connection: close
:
rb"^(?P<field_name>[-!#$%&%&'*+.^_`|~0-9a-zA-Z]+):[ \t]*(?P<field_value>([\x21-\xff]([ \t]+[\x21-\xff])?)*)[ \t]*$"
(Looking at RFC 7230 will definitely help make sense of it.)
For some reason I can't fathom, though, it seems to work unless the field value contains a single non-whitespace character that has whitespace on both sides:
In [36]: r = re.compile(rb"^(?P<field_name>[-!#$%&%&'*+.^_`|~0-9a-zA-Z]+):[ \t]*(?P<field_value>([\x21-\xff]([ \t]+[\x21-\xff])?)*)[ \t]*$")
In [38]: r.match(b"Foo: bar")
Out[38]: <_sre.SRE_Match object; span=(0, 8), match=b'Foo: bar'>
In [39]: r.match(b"Foo: bar baz quux")
Out[39]: <_sre.SRE_Match object; span=(0, 17), match=b'Foo: bar baz quux'>
In [40]: r.match(b"Foo: bar baz a quux")
In [41]: r.match(b"Foo: bar baz quux a")
Out[41]: <_sre.SRE_Match object; span=(0, 19), match=b'Foo: bar baz quux a'>
Why does the 3rd example fail to match, while all the others succeed?
Upvotes: 3
Views: 323
Reputation: 6098
When you match a word, you match the first character of the next word at the same time.
Your regex for the field value matches 0 or more copies of this regex:
[\x21-\xff]([ \t]+[\x21-\xff])?
At a word boundary (the part captured by [ \t]+
), the optional group goes on the capture the first letter of the next word. If that’s a single letter word, the entire word is consumed, and the next character is whitespace – but it’s expecting to see [\x21-\xff]
. So it fails to match.
If a single-char word is the beginning of the string, there’s no preceding word to capture it, so it’s fine. If it’s at the end of the string, it doesn’t matter that we already captured this letter – we skip straight to the end.
I would suggest simplifying the field value group to
(?P<field_value>(?:[\x21-\xff]*[ \t]?)*)
That captures arbitrary runs of \x21-\xff
characters, each followed by a single space. (I’ve also added a non-capturing group for tidiness.)
This regex passes all of your original test cases. I haven’t given it much testing, but I think it resolves this particular issue.
Upvotes: 0
Reputation: 785128
You should be using this regex:
^(?P<field_name>[-!#$%&%&'*+.^_`|~0-9a-zA-Z]+):[ \t]*(?P<field_value>[\x21-\xff]+(?:[ \t]+[\x21-\xff]+)*)[ \t]*$
This part of your regex is faulty:
([\x21-\xff]([ \t]+[\x21-\xff])?)*
As it won't match anything after a single letter and optional spaces (before end of line).
Upvotes: 1