Reputation: 1096

Python RegEx matching only the inside regex

I have found the solution to this problem on StackOverflow sometime ago but couldn't find the solution again. I want to extract a pattern from a string.

my_string ='hello ,mister synonyms: fine, of high quality, of a high standard, quality, superior; More'

I want to extract 'fine, of high quality, of a high standard, quality, superior'

I used

match_obj = re.search(r'(synonyms: )((\w+,|; )+)', my_string)
print(match_obj.group(2))

It gives only 'fine,' I know there is something wrong in the way I am writing regex for the nested brackets in this case but I am unable to find the right way to write.

Upvotes: 0

Answers (4)

stevieb

Reputation: 9296

This will capture everything between "synonyms:" and ";" into a single string. Because the positive lookbehind (?<=synonyms: ) is a zero-width, non-capturing assertion, the only capture group will be zero ([^;]+).

test_str = "hello ,mister synonyms: fine, of high quality, of a high standard, quality, superior; More"
regex = re.compile(r'(?<=synonyms: )([^;]+)')
string = regex.search(test_str).group(0)

print(string)

Upvotes: 1

b3000

Reputation: 1677

If I understand correctly you want to match everything after synonyms: up to the semicolon?

r'(synonyms: )([\w, ]+)'

See it in action: https://regex101.com/r/jI0dV4/1

I think the flaw in your regex was essentially the placement of the |. This makes the regex match either \w, or ;_ (_ denotes space)

Note that the grouping with the round brackets always introduces new capturing groups. I used square brackets to list the allowed characters.

If you follow the link you can try out different things and get instant results and explanations.

Upvotes: 1

Jota

Reputation: 17611

If you simply want to match whatever is between "synonyms: " and ";", then you could use one of the following:

(synonyms: )([\w, ]+|[^;])+
(synonyms: )(\w+, [^;]+)+
(synonyms: )(.+)(?=;)
(synonyms: )([^;]+)

Upvotes: 1

Wiktor Stribiżew

Reputation: 626748

You can obtain the substring with comma-separated values first (you can do it with (?<=synonyms: )[^;]+ regex that only matches 1 or more characters other than ; after synonyms: substring), and then split with \s*,\s* regex (it will trim the values, too, thanks to the whitespace matched with \s*) to get the necessary values:

import re
p = re.compile(r'(?<=synonyms: )[^;]+')
test_str = "hello ,mister synonyms: fine, of high quality, of a high standard, quality, superior; More"
o = re.search(p, test_str)
if o:
    s = o.group()
    print re.split(r"\s*,\s*", s)

See IDEONE demo

UPDATE

Since your intention is to learn capturing and non-capturing groups, here is your fixed regex:

(synonyms: )((?:\s*\w+,?)+)

And the explanation:

(synonyms: ) - The first capturing group matching literally synonyms:
((?:\s*\w+,?)+) - The second capturing group that matches
- (?:\s*\w+,?)+ - 1 or more non-capturing sequence (i.e. it will not be stored in the stack) of
  - \s* - 0 or more whitespace characters
  - \w+ - 1 or more word characters ([A-Za-z0-9_])
  - ,? - 0 or 1 comma

Demo is available here.

Note 4 things:

You do not have to capture literal texts. You know them already, there is just no point in that.
Python re engine does not remember multiple captured groups as in .NET (where we have .Captures property), thus, we cannot use a capturing group to get all individual comma-separated values that easily. Nor does Python support \G in regex to get consecutive matches.
To obtain the individual entries, in Python, we'll have to split the obtain string as a second step (of course, if you need to).
Thinking about optimization, you can see that in the regex, the (?:\s*\w+,?)+ part looks tricky, but the point is that all 3 components - \s, \w and , cannot match the same text. It is important to follow the same tactics when you write really complex regexps with + quantifier set to the whole group.

Upvotes: 2

Python RegEx matching only the inside regex

Answers (4)

UPDATE

Related Questions