Reputation: 1769
I have a table with a few million rows. Currently, I'm working my way through them 10,000 at a time by doing this:
for (my $ival = 0; $ival < $c_count; $ival += 10000)
{
my %record;
my $qry = $dbh->prepare
( "select * from big_table where address not like '%-XX%' limit $ival, 10000");
$qry->execute();
$qry->bind_columns( \(@record{ @{$qry->{NAME_lc} } } ) );
while (my $record = $qry->fetch){
this_is_where_the_magic_happens($record)
}
}
I did some benchmarking and I found that the prepare/execute part, while initially fast, slows down considerably after multiple 10,000 row batch. Is this a boneheaded way to write this? I just know if I try to select everything in one go, this query takes forever.
Here's some snippets from the log:
(Thu Aug 21 12:51:59 2014) Processing records 0 to 10000
SQL Select => 1 wallclock secs ( 0.01 usr + 0.00 sys = 0.01 CPU)
(Thu Aug 21 12:52:13 2014) Processing records 10000 to 20000
SQL Select => 1 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
(Thu Aug 21 12:52:25 2014) Processing records 20000 to 30000
SQL Select => 2 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
(Thu Aug 21 12:52:40 2014) Processing records 30000 to 40000
SQL Select => 5 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
(Thu Aug 21 12:52:57 2014) Processing records 40000 to 50000
SQL Select => 13 wallclock secs ( 0.01 usr + 0.00 sys = 0.01 CPU)
...
(Thu Aug 21 14:33:19 2014) Processing records 650000 to 660000
SQL Select => 134 wallclock secs ( 0.01 usr + 0.00 sys = 0.01 CPU)
(Thu Aug 21 14:35:50 2014) Processing records 660000 to 670000
SQL Select => 138 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
(Thu Aug 21 14:38:27 2014) Processing records 670000 to 680000
SQL Select => 137 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
(Thu Aug 21 14:41:00 2014) Processing records 680000 to 690000
SQL Select => 134 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
Would it be faster to do some other way? Should I remove the 'where' clause and just throw out results I don't want in the loop?
Thanks for the help.
Upvotes: 1
Views: 2311
Reputation: 1110
Others have made useful suggestions. I'll just add a few thoughts that come to mind...
Firstly, see my old but still very relevant Advanced DBI Tutorial. Specifically page 80 which addresses paging through a large result set, which is similar to your situation. It also covers profiling and fetchrow_hashref
vs bind_columns
.
Consider creating a temporary table with an auto increment field, loading it with the data you want via an INSERT ... SELECT ...
statement, then building/enabling an index on the auto increment field (which will be faster than loading the data with the index already enabled), then select ranges of rows from that temporary table using the key value. That will be very fast for fetching but there's an up-front cost to build the temporary table.
Consider enabling mysql_use_result in DBD::mysql. Rather than load all the rows into memory within the driver, the driver will start to return rows to the application as they stream in from the server. This reduces latency and memory use but comes at the cost of holding a lock on the table.
You could combine using mysql_use_result with my previous suggestion, but it might be simpler to combine it with using SELECT SQL_BUFFER_RESULT ...
. Both would avoid the lock problem (which might not be a problem for you anyway). Per the docs, SQL_BUFFER_RESULT "forces the result to be put into a temporary table". (Trivia: I think I suggested SQL_BUFFER_RESULT to Monty many moons ago.)
Upvotes: 5
Reputation: 22893
What @Oesor says is correct, you don't want to run multiple queries (unless you know you are the only one that can modify this table).
However, you have other issues.
You don't have an ORDER BY
clause. Without that, your LIMIT
is meaningless since you won't necessarily get the same order each time.
Consider using LIMIT n OFFSET m
rather than LIMIT m,n
- it's supported by PostgreSQL and is clearer to users of other database systems.
Decide whether you are using bind_columns or returning a row reference - I can't see why you are trying to do both. Perhaps fetchrow_hashref
is what you want.
Oh - be particularly careful of using bind_colunmns
with SELECT *
. What happens if you add a new column to the table? What if that new column is called ival
?
OK, now let's look at what you're doing. It's not obvious actually, since ...magic_happens isn't a terribly descriptive name. If it's updates, then try to do this all in the database. MySQL isn't as capable as PostgreSQL, but you're still better off doing things like batch updates within the RDBMS rather than shuffling large amounts back and fore.
If not, and you want to batch or "page" the result-set then: 1. Order by primary-key (or some other unique colum-set) 2. Keep track of the final key in that batch 3. Use that in a "greater than" test in the query statement.
This will allow you to use an index (if you have one) on the relevant unique columns and should let the database skip forward to row #30000 without having to read and discard 29,999 other rows first.
Upvotes: 1
Reputation: 13792
According to your benchmarking numbers, the CPU times are very small, so you need a profile on your DBI ayer. Try to run your code to collect that statistics with DBI::Profile.
You probably need to define an index on your table to avoid full scan for that query.
Upvotes: 0
Reputation: 6642
The problem is you're running multiple queries. Your dataset may also change between queries - you may miss rows or see duplicate rows since you're running multiple queries; inserts or deletions on the items you're searching will affect this.
The reason the first ones go fast is because the DB is truncating the query when it hits 10,000 items. It's not getting all the rows matching your query, and thus running faster. It's not 'getting slower', just doing more of the work, over and over and over - getting the first 10,000 rows, getting the first 20,000 rows, the first 30,000 rows. You've written a Schlemiel the painter's database query. (http://www.joelonsoftware.com/articles/fog0000000319.html)
You should run the query without a limit and iterate over the resultset. This will ensure data integrity. You may also want to look into using where clauses that can take advantage of database indices to get a faster response to your query.
Upvotes: 4