Rajat
Rajat

Reputation: 5803

Regex Capturing Group

Say I have this dummy URL and I need to extract plants and their colors as capture groups

https://flowers.com/compare._plant1.green.402992_plant2.yellow.402228_plant3.red.403010_plant4.orange.399987.html

The following regex I have is capturing the elements I need as intended, but fails to capture anything when I have less than 4 plants in the URL. There a link to a regex tester at the bottom with sample code and URL that you can play with.

How do I modify this regex to work dynamically such that it captures what's available without requiring a static URL structure. For now, assume I am only capturing at most 4 plants (8 groups)

(flowers\.com)\/compare\._(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+)).*_(?:([^.]+)\.([^.]+))

enter image description here

https://regex101.com/r/prjAO7/2

Upvotes: 2

Views: 77

Answers (2)

The fourth bird
The fourth bird

Reputation: 163642

You could match the first plant and make the second, third and fourth one optional using a question mark non capturing group (?:..)?

Instead of using .* you might also match a dot and 1+ digits instead using \.\d+ to prevent unnecessary backtracking.

(flowers\.com)\/compare\._([^.]+)\.([^.]+)\.\d+(?:_([^.]+)\.([^.]+)\.\d+)?(?:_([^.]+)\.([^.]+)\.\d+)?(?:_([^.]+)\.([^.]+)\.\d+)?

Regex demo


Another option is to parse the url if you already know it is the flowers.com url and get the path. If the parts for the flowers are structured in the same way, you might also use a single part of the pattern _([^.]+)\.([^.]+)\.\d+

Python demo

For example

from urllib.parse import urlparse
import re

pattern = r"_([^.]+)\.([^.]+)\.\d+"

o = urlparse('https://flowers.com/compare._plant1.green.402992_plant2.yellow.402228_plant3.red.403010_plant4.orange.399987.html')
print(re.findall(pattern, o.path))

Output

[('plant1', 'green'), ('plant2', 'yellow'), ('plant3', 'red'), ('plant4', 'orange')]

Upvotes: 2

RomanPerekhrest
RomanPerekhrest

Reputation: 92904

For any number of plants:

import re

url = 'https://flowers.com/compare._plant1.green.402992_plant2.yellow.402228_plant3.red.403010_plant4.orange.399987.html'
matches = re.finditer(r'(?:\d*_)([^.]+)\.([a-z]+)\.?', re.sub(r'.+\/flowers\.com\/compare\.', '', url))
for m in matches:
    print(m.group(1), m.group(2))

Sample output:

plant1 green
plant2 yellow
plant3 red
plant4 orange

Upvotes: 2

Related Questions