PHegde
PHegde

Reputation: 45

Is there any way to obtain distinct values across all the columns from within a table in BigQuery?

I have tried to get the list of column names from a particular table using Select column_name from projectname.tablename.INFORMATION_SCHEMA.COLUMNS where table_name = 'something'.

I am not sure if the right approach is to loop across all the column name while getting distinct values for each or if there is a different methodology to follow in BigQuery.

Expecting the result to be :

Col1 Col2 Col3
1     4    7
2     5    8
3          9
           11

The query must return the distinct values across all the columns irrespective of difference in length of the values.

Upvotes: 0

Views: 1645

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173190

Not sure how practical your requirement is but for relatively small tables it might work - so below is for BigQuery Standard SQL

#standardSQL
CREATE TEMP FUNCTION DISTINCT_VALUES (arr ANY TYPE) AS (
  ARRAY(SELECT DISTINCT el FROM UNNEST(arr) AS el ORDER BY el)
);
SELECT 
  DISTINCT_VALUES(col1) col1,
  DISTINCT_VALUES(col2) col2,
  DISTINCT_VALUES(col3) col3
FROM (
  SELECT 
    ARRAY_AGG(col1) OVER() col1,
    ARRAY_AGG(col2) OVER() col2,
    ARRAY_AGG(col3) OVER() col3
  FROM `project.dataset.table`
  LIMIT 1
) t   

if to apply to sample data from your question as in below example

#standardSQL
CREATE TEMP FUNCTION DISTINCT_VALUES (arr ANY TYPE) AS (
  ARRAY(SELECT DISTINCT el FROM UNNEST(arr) AS el ORDER BY el)
);
WITH `project.dataset.table` AS (
  SELECT 1 col1, 4 col2, 7 col3 UNION ALL
  SELECT 2, 5, 8 UNION ALL
  SELECT 3, 4, 9 UNION ALL
  SELECT 1, 5, 11 
)
SELECT 
  DISTINCT_VALUES(col1) col1,
  DISTINCT_VALUES(col2) col2,
  DISTINCT_VALUES(col3) col3
FROM (
  SELECT 
    ARRAY_AGG(col1) OVER() col1,
    ARRAY_AGG(col2) OVER() col2,
    ARRAY_AGG(col3) OVER() col3
  FROM `project.dataset.table`
  LIMIT 1
) t    

result is

enter image description here

So, I think more reasonable way is to get result like below

enter image description here

which can be achieved with below query

#standardSQL
SELECT DISTINCT 'col1' col, col1 value FROM `project.dataset.table` UNION ALL
SELECT DISTINCT 'col2', col2 FROM `project.dataset.table` UNION ALL
SELECT DISTINCT 'col3', col3 FROM `project.dataset.table` 

in case if different columns have different data types - you can CAST them to STRING as in below example

#standardSQL
SELECT DISTINCT 'col1' col, CAST(col1 AS STRING) value FROM `project.dataset.table` UNION ALL
SELECT DISTINCT 'col2', CAST(col2 AS STRING) FROM `project.dataset.table` UNION ALL
SELECT DISTINCT 'col3', CAST(col3 AS STRING) FROM `project.dataset.table` 

Final notes: if number of columns large enough to type in all above queries manually - you can easily script them - see example in https://stackoverflow.com/a/61716652/5221944

Upvotes: 2

Related Questions