Reputation: 20899
I'm working on a perl script that uses DBI to unload data from a database table to a specific format. I've got something working, but the performance is... lacking.
Here's the performance critical section of the code:
while (my $row = $query->fetchrow_arrayref()) {
# Sanitize the columns to make sure certain characters are escaped with a backslash.
# The escaping is required as some binary data may be included in some columns.
# This must occur *before* the join() as $COLUMN_DELIM_STR may contain one of the special characters.
for $col (@$row) { $col =~ s/(?=[\x5C\x00-\x1F])/\\/g; }
# Output the sanitized row
print join($COLUMN_DELIM_STR, @$row) . $RECORD_DELIM_STR;
}
I've got a test table with 5 columns and 10 million rows. The total unload time comes out to 90 seconds (the output is redirected to /dev/null
so disk writes won't interfere with the benchmarks).
After experimenting with removing chunks of code to get an idea of how they affect performance, I came to the realization that the sanitization loop accounts for a large amount of processing in time, around 30 seconds (about 1/3 the total execution time). Settings DBI_PROFILE=4
shows that the fetching itself takes around 45 seconds.
Here's the kicker: removing the actual substitution step ($col =~ s/(?=[\x5C\x00-\x1F])/\\/g;
) only saves about 12 seconds of processing time. That means that the do-nothing for loop (for $col (@$row) { ; }
) incurs 18 seconds of overhead, more than the substitution itself. (This was verified by removing the loop entirely.)
Summary:
$col =~ s/...//g;
) takes 12 seconds for my test data.Question:
How do improve the performance of the sanitization step?
Bonus: Why is the for loop overhead to high?
Notes:
The sanitization itself simply puts a backslash before any of the special characters.
The sanitization is required and must be applied to every column before the join
takes place. This is a technical limitation because the $COLUMN_DELIM_STR
may contain special characters and we need them to not be escaped. In addition, the length and value of $COLUMN_DELIM_STR
may vary between runs of the script.
The number of columns can be determined in advance, but not the column names or data types. The script has no prior knowledge of which columns may or may not contain special characters that need escaping.
If there's a better way of sanitizing the column data, feel free to suggest it. I'm open to alternate ideas.
Upvotes: 0
Views: 258
Reputation: 165198
If all you want to do is dump a table as a delimited file, let the database do it. MySQL has SELECT INTO other databases have similar facilities. This avoids the overhead of copying all the data into your program, alter it, and spit it out again.
Another option is to do the escaping in the SELECT. In Oracle you can use REGEXP_REPLACE. This should do it (I may have got the details on the backslashes wrong).
REGEXP_REPLACE(column, '([^[:print:]])', '\\\\1')
The problem now is doing it to every column. You don't know how many columns you have or their names, but you can find out easily enough with SELECT * FROM table LIMIT 1
and $sth->fetchrow_hashref
or more directly with $dbh->column_info
. Now you can construct a SELECT with the right number of rows and apply REGEXP_REPLACE to each of them. This may be faster. You could even do the join in the SELECT.
You could even write a PL/SQL function to do all this for you. This will likely be the most efficient. Here's an example of writing a string join function that can be adapted to also do the escaping.
As to why the empty loop is slow, you're running it 50 million times, though 18 seconds seems quite high. My 2011 Macbook Pro can run it in about 6 seconds, let's verify the empty loop is the problem. How long does this code take?
time perl -wle 'my $rows = [1..5]; for my $row (1..10_000_000) { for $col (@$rows) {} }'
Simply iterating 50 million times (for (1..50_000_000)
) takes a third of the time. So maybe there's a way to micro-optimize the inner loop. I'll spare you, it turns out a map in void context with no block is significantly faster.
map s{(?=[\x5C\x00-\x1F])}{\\}g, @$rows;
Why? Dumping the bytecode with B::Terse tells us Perl is doing less work in the map. Here's what the inner for loop is doing:
UNOP (0x1234567890ab) null
LOGOP (0x1234567890ab) and
OP (0x1234567890ab) iter
LISTOP (0x1234567890ab) lineseq
COP (0x1234567890ab) nextstate
BINOP (0x1234567890ab) leaveloop
LOOP (0x1234567890ab) enteriter
OP (0x1234567890ab) null [3]
UNOP (0x1234567890ab) null [147]
OP (0x1234567890ab) pushmark
UNOP (0x1234567890ab) rv2av [7]
OP (0x1234567890ab) padsv [1]
PADOP (0x1234567890ab) gv GV (0x1234567890ab) *_
UNOP (0x1234567890ab) null
LOGOP (0x1234567890ab) and
OP (0x1234567890ab) iter
LISTOP (0x1234567890ab) lineseq
COP (0x1234567890ab) nextstate
PMOP (0x1234567890ab) subst
SVOP (0x1234567890ab) const [12] PV (0x1234567890ab) "2"
OP (0x1234567890ab) unstack
OP (0x1234567890ab) unstack
and here's the map.
UNOP (0x1234567890ab) null
LOGOP (0x1234567890ab) and
OP (0x1234567890ab) iter
LISTOP (0x1234567890ab) lineseq
COP (0x1234567890ab) nextstate
LOGOP (0x1234567890ab) mapwhile [8]
LISTOP (0x1234567890ab) mapstart
OP (0x1234567890ab) pushmark
UNOP (0x1234567890ab) null
PMOP (0x1234567890ab) subst
SVOP (0x1234567890ab) const [12] PV (0x1234567890ab) "2"
UNOP (0x1234567890ab) rv2av [7]
OP (0x1234567890ab) padsv [1]
OP (0x1234567890ab) unstack
Basically, the for loop has to go through the extra work of setting up a new lexical context for every iteration. The map does not, but you can't use a block. Interestingly, s/1/2/ for @$rows
compiles nearly the same as for (@$rows) { s/1/2/ }
.
Upvotes: 5
Reputation: 386361
For me,
The test harness plus the loop takes 0.960 µs + 0.141 µs per element.
Looping over 5 elements thus becomes 1.66 µs
The numbers can vary in practice, but the ratio is much more in line with my expectations than what you claim. Performing a regex-based substitution is rather costly, but incrementing a counter is not, so the loop should be much cheaper than the substitution.
use strict;
use warnings;
use Benchmark qw( timethese );
my %tests = (
'for' => 'my $_col = our $col; our $row; for my $col (@$row) { }',
's///' => 'my $_col = our $col; $_col =~ s/(?=[\\x5C\\x00-\\x1F])/\\\\/g;',
);
$_ = 'use strict; use warnings; '.$_ for values %tests;
{
local our $row = [('a')x1000];
local our $col = "abc\x00def";
timethese(-3, \%tests);
}
{
local our $row = [];
local our $col = "abc\x00def";
timethese(-3, \%tests);
}
Output:
for
(1000 elements): 7065.42/sfor
(0 elements): 1041030.65/ss///
: 284348.25/sUpvotes: 1