Perl DBI - For loop killing performance?

Question

I'm working on a perl script that uses DBI to unload data from a database table to a specific format. I've got something working, but the performance is... lacking.

Here's the performance critical section of the code:

while (my $row = $query->fetchrow_arrayref()) {
    # Sanitize the columns to make sure certain characters are escaped with a backslash.
    # The escaping is required as some binary data may be included in some columns.
    # This must occur *before* the join() as $COLUMN_DELIM_STR may contain one of the special characters.
    for $col (@$row) { $col =~ s/(?=[\x5C\x00-\x1F])/\/g; }

    # Output the sanitized row
    print join($COLUMN_DELIM_STR, @$row) . $RECORD_DELIM_STR;
}

I've got a test table with 5 columns and 10 million rows. The total unload time comes out to 90 seconds (the output is redirected to /dev/null so disk writes won't interfere with the benchmarks).

After experimenting with removing chunks of code to get an idea of how they affect performance, I came to the realization that the sanitization loop accounts for a large amount of processing in time, around 30 seconds (about 1/3 the total execution time). Settings DBI_PROFILE=4 shows that the fetching itself takes around 45 seconds.

Here's the kicker: removing the actual substitution step ($col =~ s/(?=[\x5C\x00-\x1F])/\/g;) only saves about 12 seconds of processing time. That means that the do-nothing for loop (for $col (@$row) { ; }) incurs 18 seconds of overhead, more than the substitution itself. (This was verified by removing the loop entirely.)

Summary:

The sanitization loop takes about 1/3 the total execution time, around 30 seconds for my test data. It takes proportionately more time based on the number of columns in the source data.
The substitution part of the sanitization loop ($col =~ s/...//g;) takes 12 seconds for my test data.
The remaining 18 seconds is the for loop structure itself.

Question:

How do improve the performance of the sanitization step?
Bonus: Why is the for loop overhead to high?

Notes:

The sanitization itself simply puts a backslash before any of the special characters.
The sanitization is required and must be applied to every column before the join takes place. This is a technical limitation because the $COLUMN_DELIM_STR may contain special characters and we need them to not be escaped. In addition, the length and value of $COLUMN_DELIM_STR may vary between runs of the script.
The number of columns can be determined in advance, but not the column names or data types. The script has no prior knowledge of which columns may or may not contain special characters that need escaping.
If there's a better way of sanitizing the column data, feel free to suggest it. I'm open to alternate ideas.

Schwern · Accepted Answer

If all you want to do is dump a table as a delimited file, let the database do it. MySQL has SELECT INTO other databases have similar facilities. This avoids the overhead of copying all the data into your program, alter it, and spit it out again.

Another option is to do the escaping in the SELECT. In Oracle you can use REGEXP_REPLACE. This should do it (I may have got the details on the backslashes wrong).

REGEXP_REPLACE(column, '([^[:print:]])', '\\1')

The problem now is doing it to every column. You don't know how many columns you have or their names, but you can find out easily enough with SELECT * FROM table LIMIT 1 and $sth->fetchrow_hashref or more directly with $dbh->column_info. Now you can construct a SELECT with the right number of rows and apply REGEXP_REPLACE to each of them. This may be faster. You could even do the join in the SELECT.

You could even write a PL/SQL function to do all this for you. This will likely be the most efficient. Here's an example of writing a string join function that can be adapted to also do the escaping.

As to why the empty loop is slow, you're running it 50 million times, though 18 seconds seems quite high. My 2011 Macbook Pro can run it in about 6 seconds, let's verify the empty loop is the problem. How long does this code take?

time perl -wle 'my $rows = [1..5]; for my $row (1..10_000_000) { for $col (@$rows) {} }'

Simply iterating 50 million times (for (1..50_000_000)) takes a third of the time. So maybe there's a way to micro-optimize the inner loop. I'll spare you, it turns out a map in void context with no block is significantly faster.

map s{(?=[\x5C\x00-\x1F])}{\}g, @$rows;

Why? Dumping the bytecode with B::Terse tells us Perl is doing less work in the map. Here's what the inner for loop is doing:

    UNOP (0x1234567890ab) null 
        LOGOP (0x1234567890ab) and 
            OP (0x1234567890ab) iter 
            LISTOP (0x1234567890ab) lineseq 
                COP (0x1234567890ab) nextstate 
                BINOP (0x1234567890ab) leaveloop 
                    LOOP (0x1234567890ab) enteriter 
                        OP (0x1234567890ab) null [3] 
                        UNOP (0x1234567890ab) null [147] 
                            OP (0x1234567890ab) pushmark 
                            UNOP (0x1234567890ab) rv2av [7] 
                                OP (0x1234567890ab) padsv [1] 
                        PADOP (0x1234567890ab) gv  GV (0x1234567890ab) *_ 
                    UNOP (0x1234567890ab) null 
                        LOGOP (0x1234567890ab) and 
                            OP (0x1234567890ab) iter 
                            LISTOP (0x1234567890ab) lineseq 
                                COP (0x1234567890ab) nextstate 
                                PMOP (0x1234567890ab) subst 
                                    SVOP (0x1234567890ab) const [12] PV (0x1234567890ab) "2" 
                                OP (0x1234567890ab) unstack 
                OP (0x1234567890ab) unstack

and here's the map.

    UNOP (0x1234567890ab) null 
        LOGOP (0x1234567890ab) and 
            OP (0x1234567890ab) iter 
            LISTOP (0x1234567890ab) lineseq 
                COP (0x1234567890ab) nextstate 
                LOGOP (0x1234567890ab) mapwhile [8] 
                    LISTOP (0x1234567890ab) mapstart 
                        OP (0x1234567890ab) pushmark 
                        UNOP (0x1234567890ab) null 
                            PMOP (0x1234567890ab) subst 
                                SVOP (0x1234567890ab) const [12] PV (0x1234567890ab) "2" 
                        UNOP (0x1234567890ab) rv2av [7] 
                            OP (0x1234567890ab) padsv [1] 
                OP (0x1234567890ab) unstack

Basically, the for loop has to go through the extra work of setting up a new lexical context for every iteration. The map does not, but you can't use a block. Interestingly, s/1/2/ for @$rows compiles nearly the same as for (@$rows) { s/1/2/ }.

Perl DBI - For loop killing performance?

Answers (2)

Related Questions