Jean
Jean

Reputation: 22745

Improve Python code performance

How do I improve the performance of this simple piece of python code ? Isn't re.search the best way to find a matching line, since it is almost ~6x slower than Perl or am I doing something wrong ?

#!/usr/bin/env python

import re
import time
import sys

i=0
j=0
time1=time.time()
base_register =r'DramBaseAddress\d+'
for line in  open('rndcfg.cfg'):
    i+=1
    if(re.search(base_register, line)):
        j+=1
time2=time.time()

print (i,j)
print (time2-time1)    
print (sys.version)

This code takes about 0.96 seconds to complete (Average of 10 runs)
Output:

168197 2688
0.8597519397735596
3.3.2 (default, Sep 24 2013, 15:14:17)
[GCC 4.1.1]

while the following Perl code does it in 0.15 seconds.

#!/usr/bin/env perl
use strict;
use warnings;

use Time::HiRes qw(time);

my $i=0;my $j=0;
my $time1=time;
open(my $fp, 'rndcfg.cfg');
while(<$fp>)
{
    $i++;
    if(/DramBaseAddress\d+/)
    {
        $j++;
    }
}
close($fp);
my $time2=time;

printf("%d,%d\n",$i,$j);
printf("%f\n",$time2-$time1);
printf("%s\n",$]);


Output:

168197,2688
0.135579
5.012001

EDIT: Corrected regular expression - Which worsened the performance slightly

Upvotes: 3

Views: 236

Answers (3)

Veedrac
Veedrac

Reputation: 60207

The overhead of calling re.compile, despite the caching, is massive. Use

is_wanted_line = re.compile(r"DramBaseAddress\d+").search

for i, line in enumerate(open('rndcfg.cfg')):
    if is_wanted_line(line):
        j += 1

instead.

Further, you can do

key = "DramBaseAddress"
is_wanted_line = re.compile(r"DramBaseAddress\d+").search

for i, line in enumerate(open('rndcfg.cfg')):
    if key in line and is_wanted_line(line):
        j += 1

to further reduce overhead.

You can also consider doing your own buffering:

key = b"DramBaseAddress"
is_wanted_line = re.compile(rb"DramBaseAddress\d+").search

with open("rndcfg.cfg", "rb") as file:
    rest = b""

    for chunk in iter(lambda: file.read(32768), b""):
        i += chunk.count(b"\n")
        chunk, _, rest = (rest + chunk).rpartition(b"\n")

        if key in rest and is_wanted_line(chunk):
            j += 1

    if key in rest and is_wanted_line(rest):
        j += 1

which removes the line-splitting and encoding overhead. (This isn't quite the same as it doesn't account for multiple instances per chunk. Such behaviour is relatively simple to add, but may not strictly be needed in your case.)

This is a bit heavyweight, but thrice as fast as the Perl - 8x if you remove i += chunk.count(b"\n")!

Upvotes: 1

oxymor0n
oxymor0n

Reputation: 1097

actually, regular expression is less efficient than the string methods in Python. From https://docs.python.org/2/howto/regex.html#use-string-methods:

Strings have several methods for performing operations with fixed strings and they’re usually much faster, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine.

replacing re.search with str.find will give you better runtime. otherwise, using the in operator that others suggested would be optimized, too.

as for the speed difference between the Python & Perl version, i'll just chalk it up to the inherent quality of each language: text processing - python vs perl performance

Upvotes: 5

Peter Tillemans
Peter Tillemans

Reputation: 35341

In this case you are using a fixed string, not a regular expression.

For regular strings there are faster methods:

>>> timeit.timeit('re.search(regexp, "banana")', setup = "import re;     regexp=r'nan'")
1.2156920433044434
>>> timeit.timeit('"banana".index("nan")')
0.23752403259277344
>>> timeit.timeit('"banana".find("nan")')
0.2411658763885498

Now this kind of text processing is the sweet spot of Perl (aka Practical Extraction and Reporting Language) (aka Pathological Eclectic Rubbish Lister) and has been optimized extensively over the years. All that collective focus adds up.

Upvotes: 1

Related Questions