seandavi
seandavi

Reputation: 2968

regex for capturing group that is only sometimes present

I have a set of filenames like:

PATJVI_RNA_Tumor_8_3_63BJTAAXX.310_BUSTARD-2012-02-19.fq.gz
PATMIF_RNA_Tumor_CGTGAT_2_1_BC0NKBACXX.334_BUSTARD-2012-05-07.fq.gz

I would like to have a single regex (in python, fyi) that can capture each of the groups between the "_" characters. However, note that in the second filename, there is a group that is present that is not present in the first filename. Of course, one can use a string split, etc., but I would like to do this with a single regex. The regex for the first filename is something like:

(\w+)_(\w+)_(\w+)_(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz

And the second will be:

(\w+)_(\w+)_(\w+)_(\w+)_(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz

I'd like the regex group to be empty when the optional group is present and contain the optional group when it is present (so that I can use it later to in constructing a new filename with \4).

Upvotes: 2

Views: 152

Answers (1)

Yossi
Yossi

Reputation: 12100

To make a group optional, you can add ? after the desired group. Like this: (\w+)?

But your example has an underscore that should be optional as well. To deal with it, you can group it together with optional group.

((\w+)_)?

However this will add a new group to your match results. To avoid it, use a non-matching group:

(?:(\w+)_)?

The final result will look like this:

(\w+)_(\w+)_(\w+)_(?:(\w+)_)?(\d)_(\d)_(\w+)\.(\d+)_(\S+)\.fq\.gz

Upvotes: 7

Related Questions