PARTEEK KANSAL
PARTEEK KANSAL

Reputation: 43

regex pattern not matching continuous groups

I am trying the following pattern :

[,;\" ](.+?\/.+?)[\",; ]

in the following string:

['"text/html,application/xhtml+xml,application/xml;q=0.9;q =0.8"']

It matches the bold text but not the italic one. Why?

I want to extract text/html, application/xhtml+xml and application/xml. It is extracting 1st and 3rd but not the middle one

Upvotes: 1

Views: 134

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627100

Your last [,"; ] consumes the , after text/html and thus, at the next iteration, when the regex engine searches for a match, the first [,;" ] cannot match that comma. Hence, you lose one match.

You may turn the trailing [,"; ] into a non-consuming pattern, a positive lookahead, or better, since the matches cannot contain the delimiters, use a negated character class approach:

[,;" ]([^/,;" ]+/[^/,;" ]+)

See the regex demo. If there can be more than 1 / inside the expected matches, remove / char from the second character class.

Details

  • [,;" ] - a comma, ;, ", or space
  • ([^/,;" ]+/[^/,;" ]+) - Group 1: any one or more chars that is not /, ,. ;, " and space, / and then again any one or more chars that is not /, ,. ;, " and space as many as possible

Python demo:

import re
rx = r'[,;" ]([^/,;" ]+/[^/,;" ]+)'
s = """['"text/html,application/xhtml+xml,application/xml;q=0.9;q =0.8"']"""
res = re.findall(rx, s)
print(res) # => ['text/html', 'application/xhtml+xml', 'application/xml']

Upvotes: 1

Related Questions