Pranav
Pranav

Reputation: 67

How to select columns of data in BigQuery that has all NULL values

How to select columns of data in BigQuery that has all NULL values

A          B          C 
NULL       1           NULL
NULL       NULL        NULL
NULL       2           NULL
NULL       3           NULL

I want to retrieve columns A and C. Please can you help!!

Upvotes: 3

Views: 7699

Answers (2)

Elliott Brossard
Elliott Brossard

Reputation: 33705

Expanding on my comment on Mikhail's answer, this is what I had in mind. It doesn't require generating a query string, which could be quite long if you have a large number of columns. It compares the count of null values for each column name to the total number of rows in the table to decide if the column should be included in the result.

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT NULL A,  1 B,  NULL C  UNION ALL
      SELECT NULL,    NULL, NULL    UNION ALL
      SELECT NULL,    2,    NULL    UNION ALL
      SELECT NULL,    3,    NULL 
    )
    SELECT null_column
    FROM `project.dataset.table` AS t,
      UNNEST(REGEXP_EXTRACT_ALL(
        TO_JSON_STRING(t),
        r'\"([a-zA-Z0-9\_]+)\":null')
      ) AS null_column
    GROUP BY null_column
    HAVING COUNT(*) = (SELECT COUNT(*) FROM `project.dataset.table`);

Upvotes: 11

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

Below is for BigQuery StandardSQL

Simple option:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT NULL A,  1 B,  NULL C  UNION ALL
  SELECT NULL,    NULL, NULL    UNION ALL
  SELECT NULL,    2,    NULL    UNION ALL
  SELECT NULL,    3,    NULL 
)
SELECT COUNT(A) A, COUNT(B) B, COUNT(C) C
FROM `project.dataset.table`

it returns below where 0(zero) indicates that respective column has all NULLs

A   B   C    
0   3   0    

If this is "not enough" - below is more "sophisticated" version:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT NULL A,  1 B,  NULL C  UNION ALL
  SELECT NULL,    NULL, NULL    UNION ALL
  SELECT NULL,    2,    NULL    UNION ALL
  SELECT NULL,    3,    NULL 
)
SELECT SPLIT(y, ':')[OFFSET(0)] column
FROM (
  SELECT REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}"]', '') x
  FROM (
    SELECT COUNT(A) A, COUNT(B) B, COUNT(C) C
    FROM `project.dataset.table`
  ) t
), UNNEST(SPLIT(x)) y
WHERE CAST(SPLIT(y, ':')[OFFSET(1)] AS INT64) = 0 

it returns result as below - enlisting only columns with all NULLs

column   
A    
C    

Note: for your real table - just remove WITH block and replace project.dataset.table with your real table reference

Also, of course, use real column names

My table has round 700 columns..

Below is an example of how you can easily generate above query for any number of columns.

1. Just run below
2. Copy result - this is a generated query
3. paste generated query into new UI and run it
4. Enjoy (I hope you will) result :o)

Of course, as usually replace project.dataset.table with your real table reference

#standardSQL
SELECT 
  CONCAT('''
SELECT SPLIT(y, ':')[OFFSET(0)] column
FROM (
  SELECT REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}"]', '') x
  FROM (  
    SELECT ''', y, 
  '''
    FROM `project.dataset.table`
  ) t
), UNNEST(SPLIT(x)) y
WHERE CAST(SPLIT(y, ':')[OFFSET(1)] AS INT64) = 0  
  '''
  )
FROM (
  SELECT 
    STRING_AGG(CONCAT('COUNT(', x, ') ', x), ', ') y
  FROM (
    SELECT REGEXP_EXTRACT_ALL(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}]', ''), r'"([\w_]+)":') x
    FROM `project.dataset.table` t
    LIMIT 1
  ), UNNEST(x) x
)

Note: please pay attention to query cost - both "generation query" and final query itself will do full scan

You can generate columns list much cheaper off of table schema in any client of your choice

To test / play with it - you can use same dummy data as for initial queries in my answer

Upvotes: 2

Related Questions