Regex hangs for a pattern match

Question

I've the input file here (please extract it) and the code below. Just after print "pat= "; it hangs in the pattern match while I see in Windows Task manager Perl(v5.010000) taking 25% CPU time:

use strict;
open FP, ";

while ( $content =~ /(['"])([^\s\x00-\x1f]{300,})\1/gs ) {
#looks like we've found base encoded string etc
    my $subpat = $2;
    print "pat=<$subpat>
";
    if ( $subpat =~ m#(?:(....).{5,30}(?=\1)){50}#s ) {
      print "hello world";
    } else {
      die("");
    }
    exit;
}

EDIT: Ideally what I want to figure out any uniform repeating pattern of bytes (length:4) in contiguous groups (max 50) in which each group size could be 4+30 bytes max and 4+5 bytes min.

For example (repeating '=>' in each group of size 1 to 30 bytes):

'cs'=>'Czech','da'=>'Danish','nl'=>'Dutch','fi'=>'Finnish','fr'=>'French','de'=>'German','el'=>'Greek',

The line : print "pat= "; prints the following and hangs:

pat=<,'hr'=>'Croatian','cs'=>'Czech','da'=>'Danish','nl'=>'Dutch','fi'=>'Finnish','fr'=>'French','de'=>'German','el'=>'Greek','hi'=>'Hindi','it'=>'Italian','ja'=>'Japanese','ko'=>'Korean','no'=>'Norwegian','pl'=>'Polish','pt'=>'Portuguese','ro'=>'Romanian','ru'=>'Russian','es'=>'Spanish','sv'=>'Swedish','ca'=>'Catalan','tl'=>'Filipino','iw'=>'Hebrew','id'=>'Indonesian','lv'=>'Latvian','lt'=>'Lithuanian','sr'=>'Serbian','sk'=>'Slovak','sl'=>'Slovenian','uk'=>'Ukrainian','vi'=>'Vietnamese','sq'=>'Albanian','et'=>'Estonian','gl'=>'Galician','hu'=>'Hungarian','mt'=>'Maltese','th'=>'Thai','tr'=>'Turkish','fa'=>'Persian','af'=>'Afrikaans','ms'=>'Malay','sw'=>'Swahili','ga'=>'Irish','cy'=>'Welsh','be'=>'Belarusian','is'=>'Icelandic','mk'=>'Macedonian','yi'=>'Yiddish','hy'=>'Armenian','az'=>'Azerbaijani','eu'=>'Basque','ka'=>'Georgian','ht'=>>

bdonlan · Accepted Answer

Perl regexes aren't particularly smart. Your pattern is implemented with backtracking - basically, at each decision point (eg, the unanchored start, or the {5,30}), the regex saves its state on a stack, and continues forward with the smallest possible match. If it gets stuck, it pops a level off the stack.

Usually this works pretty well, because the size of the input is limited, there are literal or character class matches in there to cut off failed matches early, and the use of nested variable repetition directives are limited. Additionally, when backtracking is not present, regexes can be converted (in theory, although perl does not do this) into a deterministic finite automaton, which can be executed efficiently.

Here, you have a lot of wildcard matches, a large input string, plus what's effectively a 50-iteration loop, with a 25-ply branch in the middle, with a loop around it to try to find the start and end of the match. Your regex might have to backtrack 25^50 times in the worst case, and that's without even considering the lack of a start or end anchor.

So you'll need to figure out a cleverer way to make this algorithm work. Try just generating every 4-character substring of the string, and counting occurrences. For each substring that appears more than once, examine the offsets of the matches found to see if you have your pattern.

Regex hangs for a pattern match

Answers (2)

Related Questions