Viren Mody
Viren Mody

Reputation: 11

Python Regex: How to extract string between parentheses and quotes if they exist

I am trying to extract the value/argument of each trigger in Jenkinsfiles between the parentheses and the quotes if they exist.

For example, given the following:

upstream(upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS)  # just parentheses
pollSCM('H * * * *')     # single quotes and parentheses

Desired result respectively:

upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS
H * * * *

My current result:

upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS
H * * * *'        # Notice the trailing single quote

So far I have been successful with the first trigger (upstream one), but not for the second one (pollSCM) because there's still a trailing single quote.

After the group (.+), it doesn't capture the trailing single quote with \'*, but it does capture the close parenthesis with \). I could simply use .replace() or .strip() to remove it, but what is wrong with my regex pattern? How can I improve it? Here's my code:

pattern = r"[A-Za-z]*\(\'*\"*(.+)\'*\"*\)"
text1 = r"upstream(upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS)"
text2 = r"pollSCM('H * * * *')"
trigger_value1 = re.search(pattern, text1).group(1)
trigger_value2 = re.search(pattern, text2).group(1)

Upvotes: 1

Views: 877

Answers (3)

Paul Carlton
Paul Carlton

Reputation: 2993

Your \'* part of it means 0 or more matches for your single tick so the .+ will grab the last ' because it's greedy. You need to add the ? to (.+) for it to not be greedy. Basically it means to grab everything until it comes across the '.

This pattern will work for you: [A-Za-z]*\(\'*\"*(.+?)\'*\"*\)

[UPDATE]

To answer your question below I'll just add it here.

So the ? will make it not greedy up until the next character indicated in the pattern?

Yes, it basically changes repetition operators to not be greedy (lazy quantifier) because they are greedy by default. So .*?a will match everything until the first a while .*a will match everything including any a found in the string until it can't match against the string anymore. So if your string is aaaaaaaa and your regex is .*?a it will actually match every a. As an example, if you use .*?a with a substitution of b for every match on string aaaaaaaa you will get the string bbbbbbbb. .*a however on string aaaaaaaa with same substitution will return a single b.

Here's a link that explains the different quantifier types (greedy, lazy, possessive): http://www.rexegg.com/regex-quantifiers.html

Upvotes: 1

Rakesh
Rakesh

Reputation: 82765

import re
s = """upstream(upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS)  # just parentheses
pollSCM('H * * * *')"""
print(re.findall("\((.*?)\)", s))

Output:

["upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS", "'H * * * *'"]

Upvotes: 2

The fourth bird
The fourth bird

Reputation: 163277

For you example data your could make the ' optional '? and capture your values in a group and then loop through the captured groups.

\('?(.*?)'?\)

test_str = ("upstream(upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS)  # just parentheses\n"
    "pollSCM('H * * * *')     # single quotes and parentheses")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches):    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1  
        print (match.group(groupNum))

Demo Python

That would give you:

upstreamProjects: 'upstreamJob', threshold: hudson.model.Result.SUCCESS
H * * * *

To get a more strict match you could use an alternation to match between () or ('') but not with a single ' like ('H * * * *) and then loop through the captured groups. Because you now capture 2 groups where 1 of the 2 is empty you could check that you only retrieve a non empty group.

\((?:'(.*?)'|([^'].*?[^']))\)

Demo Python

Upvotes: 0

Related Questions