Jason Swett
Jason Swett

Reputation: 45134

How can I do this in a faster way?

I have a script that imports CSV files. What ends up in my database is, among other things, a list of customers and a list of addresses. I have a table called customer and another called address, where address has a customer_id.

One thing that's important to me is not to have any duplicate rows. Therefore, each time I import an address, I do something like this:

$address = new Address();
$address->setLine_1($line_1);
$address->setZip($zip);
$address->setCountry($usa);
$address->setCity($city);
$address->setState($state);
$address = Doctrine::getTable('Address')->findOrCreate($address);
$address->save();

What findOrCreate() does, as you can probably guess, is find a matching address record if it exists, otherwise just return a new Address object. Here is the code:

  public function findOrCreate($address)
  {
    $q = Doctrine_Query::create()
      ->select('a.*')
      ->from('Address a')
      ->where('a.line_1 = ?', $address->getLine_1())
      ->andWhere('a.line_2 = ?', $address->getLine_2())
      ->andWhere('a.country_id = ?', $address->getCountryId())
      ->andWhere('a.city = ?', $address->getCity())
      ->andWhere('a.state_id = ?', $address->getStateId())
      ->andWhere('a.zip = ?', $address->getZip());

    $existing_address = $q->fetchOne();

    if ($existing_address)
    {
      return $existing_address;
    }
    else
    {
      return $address;
    }
  }

The problem with doing this is that it's slow. To save each row in the CSV file (which translates into several INSERT statements on different tables), it takes about a quarter second. I'd like to get it as close to "instantaneous" as possible because I sometimes have over 50,000 rows in my CSV file. I've found that if I comment out the part of my import that saves addresses, it's much faster. Is there some faster way I could do this? I briefly considered putting an index on it but it seems like, since all the fields need to match, an index wouldn't help.

Upvotes: 0

Views: 265

Answers (4)

Jason Swett
Jason Swett

Reputation: 45134

What I ended up doing, that improved performance greatly, was to use ON DUPLICATE KEY UPDATE instead of using findOrCreate().

Upvotes: 0

mway
mway

Reputation: 4392

This certainly won't alleviate all of the time spent on tens of thousands of iterations, but why don't you manage your addresses outside of per-iteration DB queries? The general idea:

  1. Get a list of all current addresses (store it in an array)
  2. As you iterate, check array membership (checksums [sic]); if it doesn't exist, store the new address in the array and save the address to the database.

Unless I'm misunderstanding the scenario, this way you're only making INSERT queries if you have to, and you don't need to perform any SELECT queries aside from the first one.

Upvotes: 1

Alan Geleynse
Alan Geleynse

Reputation: 25139

It looks like your duplicate checking is what is slowing you down. To find out why, figure out what query Doctrine is creating and run EXPLAIN on it.

My guess would be that you will need to create some indexes. Searching through the entire table can be very slow, but adding an index to zip would allow the query to only do a full search through addresses with that zip code. The EXPLAIN will be able to guide you to other optimizations.

Upvotes: 0

Ike Walker
Ike Walker

Reputation: 65577

I recommend that you investigate loading the CSV files into MySQL using LOAD DATA INFILE:

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

In order to update existing rows, you have a couple of options. LOAD DATA INFILE does not have upsert functionality (insert...on duplicate key update), but it does have a REPLACE option, which you could use to update existing rows, but you need to make sure you have an appropriate unique index, and the REPLACE is really just a DELETE and INSERT, which is slower than an UPDATE.

Another option is to load the data from the CSV into a temporary table, then merge that table with the live table using INSERT...ON DUPLICATE KEY UPDATE. Again, make sure you have an appropriate unique index, but in this case you're doing an update instead of a delete so it should be faster.

Upvotes: 1

Related Questions