Zek
Zek

Reputation: 254

How to write this regex without catastrophic backtracking

I am trying to write a regex that will get me the contents of the 21st field in this list for the lines that starts with an I, provided that the field contains a number in this format nnn-nnnnnn (like 001-123456):

T|112||     |               | |AZ        |D         |1   |       1|
I|   10|ACAA          |BY CORD EACH             |      10.00-|       .99 |     |      .36 |1   |       1|D         |I|CO |BTE  |N| |       .00 |      .00 |15 |1    |001-123456     |ACAA 
I|   20|LEES03        |TINTED OZ                |       2.00-|      6.50 |     |     4.48 |1   |       1|D         |I|FL |LTGE |N| |       .00 |      .00 |45 |1    |001-234555     |JEE  
I|   20|LEES03        |TINTED OZ                |       2.00-|      6.50 |     |     4.48 |1   |       1|D         |I|FL |LTGE |N| |       .00 |      .00 |45 |1    |               |JEE  
I|   20|LEES03        |TINTED OZ                |       2.00-|      6.50 |     |     4.48 |1   |       1|D         |I|FL |LTGE |N| |       .00 |      .00 |45 |1    |001-234552     |JEE  

Here is the simple regex that I am using, there I am capturing the field content in the 2nd capture group:

^I(\|.*?){20}(\d{3}-\d{6})

I have read about catastrophic backtracking, but my regex skills are limited and I do not understand how to write this regex so that I do not get the catastrophic backtracking.

Help would be appreciated.

Upvotes: 3

Views: 247

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

IMO, a better way consists to split the string on pipes and then to check the first and the 21th fields. An example in command line with the autosplit parameter -a:

perl -F'\|' -anE'say $& if $F[0] eq "I" && $F[20]=~/\S+/' file

Example in a script:

use strict;
use warnings;
use feature qw(say);

my @F;
while(<DATA>) {
    @F = split /\|/;
    say $1 if $F[0] eq 'I' && $F[20] =~ /(\d+-\d+)/
}

__DATA__
T|112||     |               | |AZ        |D         |1   |       1|
I|   10|ACAA          |BY CORD EACH             |      10.00-|       .99 |     |      .36 |1   |       1|D         |I|CO |BTE  |N| |       .00 |      .00 |15 |1    |001-123456     |ACAA 
I|   20|LEES03        |TINTED OZ                |       2.00-|      6.50 |     |     4.48 |1   |       1|D         |I|FL |LTGE |N| |       .00 |      .00 |45 |1    |001-234555     |JEE  
I|   20|LEES03        |TINTED OZ                |       2.00-|      6.50 |     |     4.48 |1   |       1|D         |I|FL |LTGE |N| |       .00 |      .00 |45 |1    |               |JEE  
I|   20|LEES03        |TINTED OZ                |       2.00-|      6.50 |     |     4.48 |1   |       1|D         |I|FL |LTGE |N| |       .00 |      .00 |45 |1    |001-234552     |JEE  

Upvotes: 5

anubhava
anubhava

Reputation: 784938

You can avoid catastrophic backtracking by using negation pattern:

^I(?:\|[^|]*){20}(\d{3}-\d{6})

[^|]* matches 0 or more character that are not |

RegEx Demo

Upvotes: 5

Related Questions