Steve
Steve

Reputation: 3095

MySQL update query low performance - are my indexes wrong?

I have two tables - a temporary table for working with imported data, and the existing accounts table. I need to update the user id in the temp table based on matching by an account number and group number in the accounts table.

CREATE TABLE `_temp` (
 `g_id` int(11) NOT NULL AUTO_INCREMENT,
 `g_group_norm` varchar(10) DEFAULT NULL,
 `g_uid1` varchar(25) DEFAULT NULL,
 `g_spid` int(11) DEFAULT NULL,
 `g_pid` int(11) DEFAULT NULL,
PRIMARY KEY (`g_id`),
 KEY `groupn` (`g_group_norm`) USING BTREE,
 KEY `guid` (`g_uid1`) USING BTREE,
 KEY `gspid` (`g_spid`) USING BTREE
 ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

 CREATE TABLE `accounts` (
`sa_actid` int(11) NOT NULL AUTO_INCREMENT,
`sa_grp` varchar(10) DEFAULT NULL,
`sa_account` varchar(25) DEFAULT NULL,
`sa_spid` int(11) DEFAULT NULL,
`sa_partnerid` int(11) DEFAULT NULL,
 PRIMARY KEY (`sa_actid`),
 KEY `spid` (`sa_spid`) USING BTREE,
 KEY `grp` (`sa_grp`) USING BTREE,
 KEY `act` (`sa_account`) USING BTREE,
 KEY `partnerid` (`sa_partnerid`) USING BTREE,
 ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;  

The accounts table is upwards of 5M rows. The _temp table data I'm working with now can be anywhere from 50,000 to 700,000 rows.

The query I've been using to set that g_spid = sa_spid is this:

UPDATE _temp T, accounts A SET
  T.g_spid = A.sa_spid
 WHERE T.g_uid1 = A.sa_account
   AND T.g_group_norm = A.sa_grp
   AND A.sa_partnerid = 118
   AND T.g_spid IS NULL;

The accounts table has around 3M rows, of which 2.84M are partner ID "118". The temp table update works slow, but OK on 10-20k rows, but when I have a larger set of data to work with (currently temp table is around 100k rows) it seems to never finish (it's been running for 15 minutes now).

Is there a better way to do this query? I've tried it with an inner join on g_uid1 = sa_account, g_group_norm = sa_grp etc. and that seems even slower.

Upvotes: 1

Views: 36

Answers (2)

spencer7593
spencer7593

Reputation: 108400

An equivalent result could be achieved:

 UPDATE _temp t
    SET t.g_spid =  ( SELECT MIN(a.sa_spid) 
                        FROM accounts a
                       WHERE a.sa_account   = t.g_uid1
                         AND a.sa_grp       = t.g_group_norm
                         AND a.sa_partnerid = 118
                    )
  WHERE t.g_spid       IS NULL
    AND t.g_uid1       IS NOT NULL
    AND t.g_group_norm IS NOT NULL

That correlated subquery is going to be executed for every row returned for the outer query, so for performance we want a suitable index, preferably a covering index.

With the equality conditions in the WHERE clause of the correlated subquery, we want those three columns first in the index, with the most selective column first. (With nearly 95% of the rows in accounts having sa_partner_id value of 118, that isn't very selective, so we put that third.)

ON accounts (sa_account, sa_grp, sa_partner_id, sa_spid)

We also include the sa_spid column, to make it a "covering" index, so that subquery can be satisfied entirely from the index, without a need to lookup pages in the underlying table.

(The indexes on the singleton columns might be useful for other queries, but they aren't suitable for this particular query.)

We might also add an index on the _temp table, if the conditions in the WHERE clause are selective enough. If we're needing to look at more than 10% or 15% of the rows in _temp, a full scan will likely be faster.

With each row updated, there will overhead for maintaining the index on g_spid column. For large sets, sometimes its faster to drop the index, perform the update, and the re-add the index.

(I suspect there are better choices for indexes on the _temp table, but that can't really be determined without knowing the other SQL being executed against the table. None of the indexes on the _temp table look to be suitable for this query, unless the conditions in the WHERE clause of the outer query are very selective.)


For a large number of rows in _temp, we might want to break up the operation into smaller sets.

Using the same exact query pattern, but adding another condition to the outer query to break it into smaller sets.

As an example (I don't know the datatypes of the columns, the ranges of values, or distribution. Just as an illustration of the idea, assuming group_norm is a DECIMAL value that ranges from 0.00000 to 0.99999 with an even distribution, breaking the UPDATE into ten "sets"...

first run

  AND t.group_norm >= 0.0 AND t.group_norm < 0.1

second run

  AND t.group_norm >= 0.1 AND t.group_norm < 0.2

third run

  AND t.group_norm >= 0.2 AND t.group_norm < 0.3

Upvotes: 1

ScaisEdge
ScaisEdge

Reputation: 133360

You could use a composite index on _temp (g_group_norm, g_uid1, g_spid )
and a composite on accounts (sa_partnerid, sa_account, sa_spid )

UPDATE _temp T
INNER JOIN accounts A  ON  T.g_uid1 = A.sa_account 
  AND T.g_group_norm = A.sa_grp 
    AND A.sa_partnerid = 118
SET T.g_spid = A.sa_spid
WHERE  T.g_spid IS NULL

Upvotes: 1

Related Questions