get big query table schema from select statements

Question

I realize there's a million ways to get a schema from a dataset.table in google big query....

is there a way to get schema data via a select statement? such like querying sql servers INFORMATION_SCHEMA table?

Thanks.

Mikhail Berlyant · Accepted Answer

I need to perform data profiling, and the only tool I have is the QUERY function on the webui. I want to create a query that counts nulls, non-nulls, string lengths, and such per column

Below is to give you potential direction/idea to explore and enhance up to your needs
It works relatively good for for simple schemas - looks like needs to be tuned for schemas with records and repeated
Also, note it skips columns which are NULLs in all rows of the table - so such columns are not visible for below approach

So, with fh-bigquery.reddit.subreddits as a simple test table :

#standardSQL
WITH `table` AS (
  SELECT * FROM `fh-bigquery.reddit.subreddits`
),
table_as_json AS (
  SELECT  REGEXP_REPLACE(TO_JSON_STRING(t), r'^{|}$', '') AS row
  FROM `table` AS t
),
pairs AS (
  SELECT 
    REPLACE(column_name, '"', '') AS column_name, 
    IF(SAFE_CAST(column_value AS STRING)='null',NULL,column_value) AS column_value
  FROM table_as_json, UNNEST(SPLIT(row, ',"')) AS z, 
  UNNEST([SPLIT(z, ':')[SAFE_OFFSET(0)]]) AS column_name,
  UNNEST([SPLIT(z, ':')[SAFE_OFFSET(1)]]) AS column_value
)
SELECT 
  column_name,
  COUNT(DISTINCT column_value) AS _distinct_values,
  COUNTIF(column_value IS NULL) AS _nulls,
  COUNTIF(column_value IS NOT NULL) AS _non_nulls,
  MIN(LENGTH(SAFE_CAST(column_value AS STRING))) AS _min_length,
  MAX(LENGTH(SAFE_CAST(column_value AS STRING))) AS _max_length,
  ROUND(AVG(LENGTH(SAFE_CAST(column_value AS STRING)))) AS _avr_length
FROM pairs 
WHERE column_name <> ''
GROUP BY column_name
ORDER BY column_name

Result is

column_name   _nulls    _non_nulls  _min_length _max_length _avr_length  
-----------   ------  ----------  ----------- ----------- -----------
c_posts       0       2499          1            4             4.0   
created_utc   0       2499         14           14            14.0   
downs         0       2499          1            8             5.0   
num_comments  0       2499          1            7             5.0   
score         0       2499          1            7             5.0   
subr          0       2499          4           23            12.0   
ups           0       2499          1            8             5.0

I think it is very close to what is called profiling (and within the scope of what is available for you)
You can easily add any column metrics, etc.

I really think - this can be good starting point for you

get big query table schema from select statements

Answers (2)

Related Questions