Reputation: 2561
Lets say I have the following data
+-------+---------+
| col1 | col2 |
+-------+---------+
| 1 | a |
| 2 | a |
| 3 | a |
| 2 | a |
| 5 | d |
| 5 | b |
+-------+---------+
I would like to write a query that returns two columns, column and their unique values. Thus;
+---------+---------+
| column | value |
+---------+---------+
| col1 | 1 |
| col1 | 2 |
| col1 | 3 |
| col1 | 5 |
| col2 | a |
| col2 | b |
| col2 | d |
+----------+--------+
I could achieve this with the following query:
SELECT 'col1' AS column, DISTINCT(col1) AS value FROM db
UNION ALL
SELECT 'col2' AS column, DISTINCT(col2) AS value FROM db
It works fine, but in my real DB I have more than 300 million rows and 300+ columns. I do believe all the UNION ALL
will slow the process a lot and I'm wondering if there is any other way? The aggregated results will be fetched in R/Python
so if a bit of extra manipulation is needed on a much smaller table, that is fine.
Upvotes: 0
Views: 80
Reputation: 562791
The query is bound to do a table-scan no matter what you do. It has to examine every row to make sure it has found all distinct values.
A better way to get your values faster is to create a lookup table for each column, that has only one row per distinct value. Querying the lookup table will be the most efficient (you might still have a lot of distinct values in the lookup table, but it will be better than querying your 300 million row table that has duplicate occurrences of values).
By the way, DISTINCT
is not a function you can apply to one column. It's a query modifier, it must follow SELECT
, before all the select-list. It applies to the whole row, not just one column.
SELECT DISTINCT 'col1' AS `column`, col1 AS value FROM db
UNION ALL
SELECT DISTINCT 'col2', col2 FROM db
column
is a reserved keyword, so it must be delimited to use it as an identifier.
The first query of a set of UNIONed queries determines the column aliases. Aliases defined in subsequent queries after the first are ignored.
Upvotes: 1