Reputation: 3095
I have two tables - a temporary table for working with imported data, and the existing accounts table. I need to update the user id in the temp table based on matching by an account number and group number in the accounts table.
CREATE TABLE `_temp` (
`g_id` int(11) NOT NULL AUTO_INCREMENT,
`g_group_norm` varchar(10) DEFAULT NULL,
`g_uid1` varchar(25) DEFAULT NULL,
`g_spid` int(11) DEFAULT NULL,
`g_pid` int(11) DEFAULT NULL,
PRIMARY KEY (`g_id`),
KEY `groupn` (`g_group_norm`) USING BTREE,
KEY `guid` (`g_uid1`) USING BTREE,
KEY `gspid` (`g_spid`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
CREATE TABLE `accounts` (
`sa_actid` int(11) NOT NULL AUTO_INCREMENT,
`sa_grp` varchar(10) DEFAULT NULL,
`sa_account` varchar(25) DEFAULT NULL,
`sa_spid` int(11) DEFAULT NULL,
`sa_partnerid` int(11) DEFAULT NULL,
PRIMARY KEY (`sa_actid`),
KEY `spid` (`sa_spid`) USING BTREE,
KEY `grp` (`sa_grp`) USING BTREE,
KEY `act` (`sa_account`) USING BTREE,
KEY `partnerid` (`sa_partnerid`) USING BTREE,
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
The accounts table is upwards of 5M rows. The _temp table data I'm working with now can be anywhere from 50,000 to 700,000 rows.
The query I've been using to set that g_spid = sa_spid is this:
UPDATE _temp T, accounts A SET
T.g_spid = A.sa_spid
WHERE T.g_uid1 = A.sa_account
AND T.g_group_norm = A.sa_grp
AND A.sa_partnerid = 118
AND T.g_spid IS NULL;
The accounts table has around 3M rows, of which 2.84M are partner ID "118". The temp table update works slow, but OK on 10-20k rows, but when I have a larger set of data to work with (currently temp table is around 100k rows) it seems to never finish (it's been running for 15 minutes now).
Is there a better way to do this query? I've tried it with an inner join on g_uid1 = sa_account, g_group_norm = sa_grp etc. and that seems even slower.
Upvotes: 1
Views: 36
Reputation: 108400
An equivalent result could be achieved:
UPDATE _temp t
SET t.g_spid = ( SELECT MIN(a.sa_spid)
FROM accounts a
WHERE a.sa_account = t.g_uid1
AND a.sa_grp = t.g_group_norm
AND a.sa_partnerid = 118
)
WHERE t.g_spid IS NULL
AND t.g_uid1 IS NOT NULL
AND t.g_group_norm IS NOT NULL
That correlated subquery is going to be executed for every row returned for the outer query, so for performance we want a suitable index, preferably a covering index.
With the equality conditions in the WHERE clause of the correlated subquery, we want those three columns first in the index, with the most selective column first. (With nearly 95% of the rows in accounts having sa_partner_id value of 118, that isn't very selective, so we put that third.)
ON accounts (sa_account, sa_grp, sa_partner_id, sa_spid)
We also include the sa_spid column, to make it a "covering" index, so that subquery can be satisfied entirely from the index, without a need to lookup pages in the underlying table.
(The indexes on the singleton columns might be useful for other queries, but they aren't suitable for this particular query.)
We might also add an index on the _temp
table, if the conditions in the WHERE clause are selective enough. If we're needing to look at more than 10% or 15% of the rows in _temp
, a full scan will likely be faster.
With each row updated, there will overhead for maintaining the index on g_spid
column. For large sets, sometimes its faster to drop the index, perform the update, and the re-add the index.
(I suspect there are better choices for indexes on the _temp
table, but that can't really be determined without knowing the other SQL being executed against the table. None of the indexes on the _temp table look to be suitable for this query, unless the conditions in the WHERE clause of the outer query are very selective.)
For a large number of rows in _temp, we might want to break up the operation into smaller sets.
Using the same exact query pattern, but adding another condition to the outer query to break it into smaller sets.
As an example (I don't know the datatypes of the columns, the ranges of values, or distribution. Just as an illustration of the idea, assuming group_norm
is a DECIMAL value that ranges from 0.00000 to 0.99999 with an even distribution, breaking the UPDATE into ten "sets"...
first run
AND t.group_norm >= 0.0 AND t.group_norm < 0.1
second run
AND t.group_norm >= 0.1 AND t.group_norm < 0.2
third run
AND t.group_norm >= 0.2 AND t.group_norm < 0.3
Upvotes: 1
Reputation: 133360
You could use a composite index on _temp (g_group_norm, g_uid1, g_spid )
and a composite on accounts (sa_partnerid, sa_account, sa_spid )
UPDATE _temp T
INNER JOIN accounts A ON T.g_uid1 = A.sa_account
AND T.g_group_norm = A.sa_grp
AND A.sa_partnerid = 118
SET T.g_spid = A.sa_spid
WHERE T.g_spid IS NULL
Upvotes: 1