regex pattern not matching continuous groups

Question

I am trying the following pattern :

[,;\" ](.+?\/.+?)[\",; ]

in the following string:

['"text/html,application/xhtml+xml,application/xml;q=0.9;q =0.8"']

It matches the bold text but not the italic one. Why?

I want to extract text/html, application/xhtml+xml and application/xml. It is extracting 1st and 3rd but not the middle one

Wiktor Stribiżew · Accepted Answer

Your last [,"; ] consumes the , after text/html and thus, at the next iteration, when the regex engine searches for a match, the first [,;" ] cannot match that comma. Hence, you lose one match.

You may turn the trailing [,"; ] into a non-consuming pattern, a positive lookahead, or better, since the matches cannot contain the delimiters, use a negated character class approach:

[,;" ]([^/,;" ]+/[^/,;" ]+)

See the regex demo. If there can be more than 1 / inside the expected matches, remove / char from the second character class.

Details

[,;" ] - a comma, ;, ", or space
([^/,;" ]+/[^/,;" ]+) - Group 1: any one or more chars that is not /, ,. ;, " and space, / and then again any one or more chars that is not /, ,. ;, " and space as many as possible

Python demo:

import re
rx = r'[,;" ]([^/,;" ]+/[^/,;" ]+)'
s = """['"text/html,application/xhtml+xml,application/xml;q=0.9;q =0.8"']"""
res = re.findall(rx, s)
print(res) # => ['text/html', 'application/xhtml+xml', 'application/xml']

regex pattern not matching continuous groups

Answers (1)

Related Questions