MLEN
MLEN

Reputation: 2561

Fastest way to select distinct values on multiple columns individually

Lets say I have the following data

+-------+---------+
| col1  | col2    |
+-------+---------+
|  1    | a       |
|  2    | a       |
|  3    | a       |
|  2    | a       |
|  5    | d       |
|  5    | b       |
+-------+---------+

I would like to write a query that returns two columns, column and their unique values. Thus;

+---------+---------+
| column  | value   |
+---------+---------+
|  col1    | 1      |
|  col1    | 2      |
|  col1    | 3      |
|  col1    | 5      |
|  col2    | a      |
|  col2    | b      |
|  col2    | d      |
+----------+--------+

I could achieve this with the following query:

    SELECT 'col1' AS column, DISTINCT(col1) AS value FROM db
UNION ALL
    SELECT 'col2' AS column, DISTINCT(col2) AS value FROM db

It works fine, but in my real DB I have more than 300 million rows and 300+ columns. I do believe all the UNION ALL will slow the process a lot and I'm wondering if there is any other way? The aggregated results will be fetched in R/Python so if a bit of extra manipulation is needed on a much smaller table, that is fine.

Upvotes: 0

Views: 80

Answers (1)

Bill Karwin
Bill Karwin

Reputation: 562791

The query is bound to do a table-scan no matter what you do. It has to examine every row to make sure it has found all distinct values.

A better way to get your values faster is to create a lookup table for each column, that has only one row per distinct value. Querying the lookup table will be the most efficient (you might still have a lot of distinct values in the lookup table, but it will be better than querying your 300 million row table that has duplicate occurrences of values).

By the way, DISTINCT is not a function you can apply to one column. It's a query modifier, it must follow SELECT, before all the select-list. It applies to the whole row, not just one column.

SELECT DISTINCT 'col1' AS `column`, col1 AS value FROM db
UNION ALL
SELECT DISTINCT 'col2', col2 FROM db

column is a reserved keyword, so it must be delimited to use it as an identifier.

The first query of a set of UNIONed queries determines the column aliases. Aliases defined in subsequent queries after the first are ignored.

Upvotes: 1

Related Questions