Improving SELECT performance in two unioned (UNION ALL) huge tables

Question

I have two huge tables and need the following query be to executed. Both table authors_1 (~20M rows) and authors_2 (~120M rows) have the same structure. I've done my researches and came up with the following query and table structures. It still takes a long time (usually between 10 to 20 seconds) for the query.

Here is the query:

SELECT `fname`, `lname` 
FROM (
    SELECT `fname`, `lname` 
    FROM `authors_1` 
    WHERE 1 AND `lname` LIKE 'AR%' 
    UNION ALL 
    SELECT `fname`, `lname` 
    FROM `authors_2` 
    WHERE 1 AND `lname` LIKE 'AR%') `a` 
GROUP BY CONCAT(`fname`, `lname`) 
ORDER BY `lname` 
LIMIT 0, 999;

and this is the structure, which is similar for both tables (The FT indexes are for other queries).

CREATE TABLE `scipers_authors` (
 `a_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `linker` varchar(255) COLLATE utf8_persian_ci NOT NULL,
 `fname` tinytext COLLATE utf8_persian_ci NOT NULL, /*Should this be tinytext because of FT index or I can use VARCHAR(255) while having FT index?*/
 `lname` tinytext COLLATE utf8_persian_ci NOT NULL, /*Same for this one*/
 PRIMARY KEY (`a_id`),
 UNIQUE KEY `linker` (`linker`) USING BTREE,
 KEY `lname_4` (`lname`(4)) USING BTREE,
 KEY `name` (`lname`(128),`fname`(128)) USING BTREE,
 FULLTEXT KEY `fname` (`fname`),
 FULLTEXT KEY `lname` (`lname`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_persian_ci;

This is the EXPLAIN output for the example query shown above (I don't know how to make it beautiful like others, I'm using PMA):

1   PRIMARY   ALL                 476968  Using temporary; Using filesort 
2   DERIVED authors_1   range   lname_4,name,lname  name    386     184800  Using where 
3   UNION   authors_2   range   lname_4,name,lname  name    386     292168  Using where 
    UNION RESULT      ALL                     Using temporary

Any suggestion for improving this query/structure?

Barmar · Accepted Answer

Try changing GROUP BY CONCAT(fname, lname) to GROUP BY fname, lname. Doing all those concatenations is extra work that's unnecessary.

The only semantic difference between these is if there are two people whose first and last names are different, but they're the same when concatenated, e.g. Freda Smith and Fred Asmith. Such combinations are probably unlikely, and I doubt you really wanted them to be combined into a single result in the first place.

And if you're grouping by the same set of columns you're selecting, you can simply use

SELECT DISTINCT fname, lname

You could instead use UNION DISTINCT instead of UNION ALL, since you want duplicates to be removed. Then you don't need to do grouping or DISTINCT in the outer query.

Improving SELECT performance in two unioned (UNION ALL) huge tables

Answers (1)

Related Questions