Regex Capturing Group

Question

Say I have this dummy URL and I need to extract plants and their colors as capture groups

https://flowers.com/compare._plant1.green.402992_plant2.yellow.402228_plant3.red.403010_plant4.orange.399987.html

The following regex I have is capturing the elements I need as intended, but fails to capture anything when I have less than 4 plants in the URL. There a link to a regex tester at the bottom with sample code and URL that you can play with.

How do I modify this regex to work dynamically such that it captures what's available without requiring a static URL structure. For now, assume I am only capturing at most 4 plants (8 groups)

(flowers\.com)\/compare\._(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+))

https://regex101.com/r/prjAO7/2

The fourth bird · Accepted Answer

You could match the first plant and make the second, third and fourth one optional using a question mark non capturing group (?:..)?

Instead of using .* you might also match a dot and 1+ digits instead using \.\d+ to prevent unnecessary backtracking.

(flowers\.com)\/compare\._([^.]+)\.([^.]+)\.\d+(?:_([^.]+)\.([^.]+)\.\d+)?(?:_([^.]+)\.([^.]+)\.\d+)?(?:_([^.]+)\.([^.]+)\.\d+)?

Regex demo

Another option is to parse the url if you already know it is the flowers.com url and get the path. If the parts for the flowers are structured in the same way, you might also use a single part of the pattern _([^.]+)\.([^.]+)\.\d+

Python demo

For example

from urllib.parse import urlparse
import re

pattern = r"_([^.]+)\.([^.]+)\.\d+"

o = urlparse('https://flowers.com/compare._plant1.green.402992_plant2.yellow.402228_plant3.red.403010_plant4.orange.399987.html')
print(re.findall(pattern, o.path))

Output

[('plant1', 'green'), ('plant2', 'yellow'), ('plant3', 'red'), ('plant4', 'orange')]

Regex Capturing Group

Answers (2)

Related Questions