Tim
Tim

Reputation: 1933

Regex to keep empty parts in the same group index

Context

I have a string sequence in which some parts can either be empty, or contain information. I'm looking for a regex in which each information is always on the same group index, regardless of if it's empty or not.

The expected benefit is that group indexes remain the same even if a group is empty, so I can say that this or that information is in this or that group.

Examples

Possible inputs and expected group outputs:

1. Input: ABC_0000_0.0.0.xyz
Group 1  : ABC
Group 2  : 0000
Group 3  : 0.0.0
Group 4  : <empty>
Group 5  : <empty>

2. Input: ABC_0001_0.1.0_N.xyz
Group 1  : ABC
Group 2  : 0001
Group 3  : 0.1.0
Group 4  : _N
Group 5  : <empty>

3. Input: ABC_0002_1.1.2_foo.xyz
Group 1  : ABC
Group 2  : 0002
Group 3  : 1.1.2
Group 4  : <empty>
Group 5  : _foo

4. Input: ABC_0002_42.42.42_N_bar.xyz
Group 1  : ABC
Group 2  : 0002
Group 3  : 42.42.42
Group 4  : _N
Group 5  : _bar

What I tried

I tried the following regex:

^(ABC)_([0-9]{4})_([0-9]+\.[0-9]+\.[0-9]+)(_[a-zA-Z]+)?(_[a-zA-Z]+)?\.xyz$

The problem with that one is that I only have 5 groups for example 4. For example 1 there are only 3 groups, and for examples 2-3 there are 4 groups but the fourth may contain two different types of information.

I then tried to adapt the part supposed to catch groups 4 and 5 by using a logical or to match with emptyness:

^(ABC)_([0-9]{4})_([0-9]+\.[0-9]+\.[0-9]+)(|_[a-zA-Z]+)(|_[a-zA-Z]+)\.xyz$

This is promising, it works for all examples but example 2, in which _N is put in group 5, while I want it to be in group 4.

Question

Which regex would, for the examples given, output the same groups ?

Upvotes: 2

Views: 91

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You can use

^(ABC)_([0-9]{4})_([0-9]+\.[0-9]+\.[0-9]+)(?:(_[a-zA-Z]+)(|_[a-zA-Z]+))?\.xyz$

See the regex demo

Now, due to (?:(_[a-zA-Z]+)(|_[a-zA-Z]+))?, Group 5 will only match after Group 4 is matched:

  • (?: - start of a non-capturing group:
    • (_[a-zA-Z]+) - Group 4: _ and one or more letters
    • (|_[a-zA-Z]+) - Group 5: empty string or _` and one or more letters
  • )? - end of the non-capturing group, one or zero repetitions (due to ?).

If Group 4 can only have one letter use

^(ABC)_([0-9]{4})_([0-9]+\.[0-9]+\.[0-9]+)(_[a-zA-Z])?(_[a-zA-Z]+)?\.xyz$

See the regex demo.

Now, due to (_[a-zA-Z])?, Group 4 will only match a single letter after _ before Group 5.

Upvotes: 1

Related Questions