How to find all occurences of substrings except first three, using REGEX?

Question

I have the following string,

"ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"

In above string, using REGEX, I want to find all occurrences of 'TAG' except first 3 occurrences.

I used this REGEX, '(TAG.*?){4}', but it only finds 4th occurrence ('TAG:'), but not the others ('TAG.','TAG ','TAGH').

wullxz · Accepted Answer

If you want a capture group with all the remaining matches, you have to consume the first ones first:

(TAG.*?){3}(TAG.*?)*

This matches the first 3 occurences in the first capture group and matches the rest in the 2nd.
If you don't want the first matches to be in a capture group, you can flag it as non-capturing group:

(?:TAG.*?){3}(TAG.*?)*

Depending on your example, I think the regex inside the capture group is not correct yet. If this doesn't give you the right Idea on how to do this already, please give us an example of the matches you want to see. I'll edit my answer then.

EDIT:

I get the feeling that you want to capture the 3rd and following occurences in own capture groups while still ignoring the first 3 occurences.
I can't properly explain why, but I think that's not possible because of the following reasons:

Ignoring the first 3 occurences in an own (non-)capturing group forces you to abandon the 'g' modifier for finding all occurences (because that would just do 'ignore 3 TAGS, find 1' in a loop).
It is not possible to capture multiple groups with just one capture group. Trying to do that always captures the last occurence. There is a possibility to capture not just the last but all occurences together in a single capture group but it seems like you want them in separate groups.

So, how to solve this?
I'd come up with a proper regex for one TAG and repeat that using the find all or g modifier. In python you then can simply take all findings skipping the first 3:

import re

str = "ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
pattern = r"(?:TAG((?:(?!TAG).)+))"

findings = re.findall(pattern, str)[:3]

If you want to ignore the first character after TAG, just add a . behind TAG:

pattern = r"(?:TAG.((?:(?!TAG).)+))"

Explanation of the regex:
- I use ?: to make some capturing groups non-capturing groups. I only want to deal with one capture group.
- To get rid of the non-greedy modifier and be a little bit more specific in what we actually want, I've introduced the negative lookahead after the TAG occurence.

How to find all occurences of substrings except first three, using REGEX?

Answers (1)

Related Questions