Why does my tcl regex perform so badly in comparison to perl?

Question

set fr [open "x.txt" r]
set fw [open "y.txt" w]
set myRegex {^([0-9]+) ([0-9:]+\.[0-9]+).* ABC\.([a-zA-Z]+)$$([0-9]+)$$ DEF$([a-zA-Z]+)$ HIJ$([0-9]+)$ KLM$([0-9\.]+)$ NOP$([0-9]+)$ QRS$([0-9]+)$}
while { [gets $fr line] >= 0 } {
   if { [regexp $myRegex $line match x y w z]} {
       if { [expr $D >> 32] == [lindex $argv 0]} {
         puts $fw "$x"
       }
   }
}
close $fr $fw

The above bit of tcl code takes forever (32s or more) to execute. Doing basically the same thing in perl runs in 3s or less. I know that perl performs better for some regex but will the tcl performance really be this bad in comparison? more than 10 times worse?

I'm using TCL 8.4 by the way

Here are the metrics of running the above code with the regex and with reduced versions of the same regex

32s is the time taken for the above code to execute
22s after removing: QRS$([0-9]+)$ 
17s after removing: NOP$([0-9]+)$ QRS$([0-9]+)$
13s after removing: KLM$([0-9\.]+)$ NOP$([0-9]+)$ QRS$([0-9]+)$
9s  after removing: HIJ$([0-9]+)$ KLM$([0-9\.]+)$ NOP$([0-9]+)$ QRS$([0-9]+)$
6s  after removing: DEF$([a-zA-Z]+)$ HIJ$([0-9]+)$ KLM$([0-9\.]+)$ NOP$([0-9]+)$ QRS$([0-9]+)$}

Donal Fellows · Accepted Answer

The issue is that you have a lot of capturing and backtracking in that RE; that particular combination works poorly with the Tcl RE engine. The cause on one level is that Tcl uses a completely different type of RE engine to Perl (though it works better for other REs; this area is non-trivial).

If you can, get rid of that early .* from the RE:

^([0-9]+) ([0-9:]+\.[0-9]+).* ABC\.([a-zA-Z]+)$$([0-9]+)$$ DEF$([a-zA-Z]+)$ HIJ$([0-9]+)$ KLM$([0-9\.]+)$ NOP$([0-9]+)$ QRS$([0-9]+)$
                           ^^

That's the real cause of trouble. Replace with something more exact, such as this:

(?:[^A]|A[^B]|AB[^C])*

Also, reduce the number of capturing groups in your RE to exactly those you need. You can probably convert the code overall to this:

set fr [open "x.txt" r]
set fw [open "y.txt" w]
set myRegex {^([0-9]+) (?:[0-9:]+\.[0-9]+)(?:[^A]|A[^B]|AB[^C])* ABC\.(?:[a-zA-Z]+)$$([0-9]+)$$ DEF$(?:[a-zA-Z]+)$ HIJ$(?:[0-9]+)$ KLM$(?:[0-9\.]+)$ NOP$(?:[0-9]+)$ QRS$(?:[0-9]+)$}
while { [gets $fr line] >= 0 } {
    # I've combined the [if]s and the [expr]
    if { [regexp $myRegex $line -> A D] && $D >> 32 == [lindex $argv 0]} {
        puts $fw "$A"
    }
}
close $fr $fw

Note also that if { [expr ...] } is a suspicious code smell, as is any expression that is not braced. (It's sometimes necessary in very specific circumstances, but almost always indicates that code is over-complicated.)

Why does my tcl regex perform so badly in comparison to perl?

Answers (1)

Related Questions