Reputation: 45
I have the following string,
"ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
In above string, using REGEX, I want to find all occurrences of 'TAG' except first 3 occurrences.
I used this REGEX, '(TAG.*?){4}', but it only finds 4th occurrence ('TAG:'), but not the others ('TAG.','TAG ','TAGH').
Upvotes: 0
Views: 253
Reputation: 19470
If you want a capture group with all the remaining matches, you have to consume the first ones first:
(TAG.*?){3}(TAG.*?)*
This matches the first 3 occurences in the first capture group and matches the rest in the 2nd.
If you don't want the first matches to be in a capture group, you can flag it as non-capturing group:
(?:TAG.*?){3}(TAG.*?)*
Depending on your example, I think the regex inside the capture group is not correct yet. If this doesn't give you the right Idea on how to do this already, please give us an example of the matches you want to see. I'll edit my answer then.
EDIT:
I get the feeling that you want to capture the 3rd and following occurences in own capture groups while still ignoring the first 3 occurences.
I can't properly explain why, but I think that's not possible because of the following reasons:
So, how to solve this?
I'd come up with a proper regex for one TAG and repeat that using the find all
or g
modifier. In python you then can simply take all findings skipping the first 3:
import re
str = "ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
pattern = r"(?:TAG((?:(?!TAG).)+))"
findings = re.findall(pattern, str)[:3]
If you want to ignore the first character after TAG
, just add a .
behind TAG
:
pattern = r"(?:TAG.((?:(?!TAG).)+))"
Explanation of the regex:
- I use ?:
to make some capturing groups non-capturing groups. I only want to deal with one capture group.
- To get rid of the non-greedy modifier and be a little bit more
specific in what we actually want, I've introduced the negative
lookahead after the TAG
occurence.
Upvotes: 1