Mallik Tiru
Mallik Tiru

Reputation: 11

data profiling on bigquery table covering min,max,unique, null count statistics

I am looking for solution to perform data profiling on bigquery table covering below statistics for each column in table. Some of the columns are ARRAY and STRUCT as given below.

I tried multiple ways to generate dynamic query to cover below scenarios but no luck.. I will greatly appreciate your help/inputs.

Metrics I want to calculate part of this solution are:

Sample Table Data:

Blockquote

Desired output enter image description here

Upvotes: 0

Views: 3249

Answers (3)

Mona Rakibe
Mona Rakibe

Reputation: 1

I highly recommend using a low-code no-code tool like Telmai for profiling entire tables in BigQuery within minutes.

All of the metrics you have mentioned are OOTB, so dont need to write any code and you and there is free tier so almost no cost to you.

https://telm-ai.webflow.io/data-profiling-and-observability

Upvotes: 0

Sabri Karagönen
Sabri Karagönen

Reputation: 2365

This query returns all the columns from a table in a dataset. I excluded STRUCTS, since you only need value columns.

SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name = 'table_name'
  AND data_type NOT LIKE 'STRUCT%'

Using the columns table, we'll generate a SQL query to get all these columns. Here, I only added MIN, MAX and COUNT DISTINCT columns. However, you can add more of them by adding new lines to SELECT part.

SELECT 
  STRING_AGG(
    CONCAT(
      'SELECT "', field_path, '" as field_path, ',
        'CAST(MIN(', field_path, ') as string) as max, ',
        'CAST(MAX(', field_path, ') as string) as min ',
        'COUNT(DISTINCT ', field_path, ') as count_distinct ',
      'FROM ', table_name) ,
    ' UNION ALL \n'
  ) as query
FROM columns

At the end, we'll run this query using EXECUTE IMMEDIATE, since it's a string:

EXECUTE IMMEDIATE (
  query
)

To bring all these queries together, it looks like that:

EXECUTE IMMEDIATE (
  SELECT 
    STRING_AGG(
      CONCAT(
        'SELECT "', field_path, '" as field_path, ',
          'CAST(MIN(', field_path, ') as string) as max, ',
          'CAST(MAX(', field_path, ') as string) as min ',
          'COUNT(DISTINCT ', field_path, ') as count_distinct ',
        'FROM ', table_name) ,
      ' UNION ALL \n'
    ) as query
  FROM (
    SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
    FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
    WHERE table_name = 'table_name'
      AND data_type NOT LIKE 'STRUCT%'
  )
)

PS: It only solves structs for now. Can you show me an example of your ARRAY columns?

Upvotes: 2

rmesteves
rmesteves

Reputation: 4085

I dont understand what you mean with Min Length and Max Length, but considering the provided data, you could do something like below.

This query have basically two steps:

  1. Create a temporary table with flat data using the WITH clause
  2. Calculate the metrics by running one query for each column and use UNION ALL to compose everything in a single table.

Query:

WITH
  t AS(
  SELECT
    first_name,
    dob,
    last_name,
    a.zip addresses_zip,
    a.state addresses_state,
    a.city addresses_city,
    a.numberOfYears addresses_numberOfYears,
    a.status addresses_status,
    a.phone.primarynumber addresses_phone_primarynumber,
    a.phone.secondary addresses_phone_secondary
  FROM
    <your-table> t,
    t.addresses a 
)

SELECT
  "first_name" AS column,
  COUNT(first_name) total_count,
  COUNT(DISTINCT first_name) total_distinct,
  SUM(
  IF
    (first_name IS NULL,
      1,
      0)) total_null,
  CAST(MIN(first_name) AS string) min_value,
  CAST(MAX(first_name) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "dob" AS column,
  COUNT(dob) total_count,
  COUNT(DISTINCT dob) total_distinct,
  SUM(
  IF
    (dob IS NULL,
      1,
      0)) total_null,
  CAST(MIN(dob) AS string) min_value,
  CAST(MAX(dob) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "last_name" AS column,
  COUNT(last_name) total_count,
  COUNT(DISTINCT last_name) total_distinct,
  SUM(
  IF
    (last_name IS NULL,
      1,
      0)) total_null,
  CAST(MIN(last_name) AS string) min_value,
  CAST(MAX(last_name) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.zip" AS column,
  COUNT(addresses_zip) total_count,
  COUNT(DISTINCT addresses_zip) total_distinct,
  SUM(
  IF
    (addresses_zip IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_zip) AS string) min_value,
  CAST(MAX(addresses_zip) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.state" AS column,
  COUNT(addresses_state) total_count,
  COUNT(DISTINCT addresses_state) total_distinct,
  SUM(
  IF
    (addresses_state IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_state) AS string) min_value,
  CAST(MAX(addresses_state) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.city" AS column,
  COUNT(addresses_city) total_count,
  COUNT(DISTINCT addresses_city) total_distinct,
  SUM(
  IF
    (addresses_city IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_city) AS string) min_value,
  CAST(MAX(addresses_city) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.numberOfYears" AS column,
  COUNT(addresses_numberOfYears) total_count,
  COUNT(DISTINCT addresses_numberOfYears) total_distinct,
  SUM(
  IF
    (addresses_numberOfYears IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_numberOfYears) AS string) min_value,
  CAST(MAX(addresses_numberOfYears) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.status" AS column,
  COUNT(addresses_status) total_count,
  COUNT(DISTINCT addresses_status) total_distinct,
  SUM(
  IF
    (addresses_status IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_status) AS string) min_value,
  CAST(MAX(addresses_status) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.phone.primarynumber" AS column,
  COUNT(addresses_phone_primarynumber) total_count,
  COUNT(DISTINCT addresses_phone_primarynumber) total_distinct,
  SUM(
  IF
    (addresses_phone_primarynumber IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_phone_primarynumber) AS string) min_value,
  CAST(MAX(addresses_phone_primarynumber) AS string) max_value
FROM
  t 

UNION ALL

SELECT
  "addresses.phone.secondary" AS column,
  COUNT(addresses_phone_secondary) total_count,
  COUNT(DISTINCT addresses_phone_secondary) total_distinct,
  SUM(
  IF
    (addresses_phone_secondary IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_phone_secondary) AS string) min_value,
  CAST(MAX(addresses_phone_secondary) AS string) max_value
FROM
  t

Upvotes: 0

Related Questions