arcee123
arcee123

Reputation: 211

How to convert columns to rows in Google BigQuery

I am performing data QA testing.

I have this query to establish any errors between the source table and the destination table.

select 
count(case when coalesce(x.col1,1) = coalesce(y.col1,1) then null else 1 end) as cnt_col1,
count(case when coalesce(x.col2,"1") = coalesce(y.col2,"1") then null else 1 end) as cnt_col2
from
`DatasetA.Table` x
OUTER JOIN
`DatasetB.Table` y
on x.col1 = y.col1

The output of this query is like this:

col1, col2
null, null
null, null
1, null
null, 1

I have 200 tables that I need to perform this test on, and the number of cols are dynamic. the table above only has two columns, some have 50.

I have the queries for the tables already, but I need to conform the output of all of the tests into a single output. My plan is to conform each query into a unified output and join them together using a UNION ALL.

The output set should say:

COLUMN, COUNT_OF_ERRORS
cnt_col1, 1
cnt_col2, 1
...
cnt_col15, 0

My question is this. How do I reverse pivot this so I can achieve the output I'm looking for.

Thanks

Upvotes: 1

Views: 9807

Answers (2)

Sourygna
Sourygna

Reputation: 719

About "comparing 2 tables" in Big Data, I don't think that doing some Joins is the best approach, because Joins are quite slow in general and then you have to handle the case of "outer" joins rows.

I worked on this topic years ago (https://community.hortonworks.com/articles/1283/hive-script-to-validate-tables-compare-one-with-an.html) and I am now trying to backport this knowledge to compare Hive tables with BigQuery tables.

One of my main idea is to use some checksums to be sure that a table is fully identical to the other one. Here is a "basic example":

with one_string as(
select concat( sessionid ,'|',referrercode ,'|',purchaseid ,'|',customerid ,'|', cast(bouncerateind as string),'|', cast( productpagevisit as string),'|', cast( itemordervalue as string),'|', cast( purchaseinsession as string),'|', cast( hit_time_gmt as string),'|',datedir ,'|',productcategory ,'|',post_cookies) as bigstring from bidwh2.omniture_2017_03_24_v2
),
shas as(
   select TO_BASE64( sha1( bigstring)) as sha from one_string
),
shas_prefix as(
   select substr( sha, 0 , 1) as prefix, sha from shas
),
shas_ordered as(
    select prefix, sha from shas_prefix order by sha
 ),
results_prefix as(
    select concat( prefix, ' ', TO_BASE64( sha1( STRING_AGG( sha, '|')))) as res from shas_ordered group by prefix
),
results_ordered as(
    select 1 as myall, res from results_prefix order by res
)
select SHA1( STRING_AGG( res, '|')) as sha from results_ordered group by myall;

So you do that on each of the 2 tables, and compare the 2 checksums numbers.

Final idea is to have an Python script (not finished yet, I hope my company allows me to opensource when finished) that would do the following:

  1. count the rows for some "buckets" (groups of rows that whose column with a good distribution has the same checksum modulo a big number) and compare the results (because there is no need to checksum the whole table if the number of rows does not match).
  2. visually shows the differences if count does not match
  3. use the bucket/rows technique + some other "buckets/columns" to do some checksums in a similar way as shown in above example. And compare all those checksums together.
  4. visually shows the differences if checksums do not match

Edit on 03/11/2017: script is finished and can be found at: https://github.com/bolcom/hive_compared_bq

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

How do I reverse pivot this so I can achieve the output I'm looking for.

Assuming you have table `data`

col1    col2    col3
----    ----    ----     
null    null    null     
null    null    1    
null    1       1    
1       null    1    
1       null    1    
1       null    1   

And you need reverse pivot it to

column      count_of_errors  
--------    ---------------
cnt_col1    3    
cnt_col2    1    
cnt_col3    5    

Below is for BigQuery Standard SQL and does exactly this

#standardSQL
WITH `data` AS (
  SELECT NULL AS col1, NULL AS col2, NULL AS col3 UNION ALL
  SELECT NULL, NULL, 1 UNION ALL
  SELECT 1, NULL, 1 UNION ALL
  SELECT NULL, 1, 1 UNION ALL
  SELECT 1, NULL, 1 UNION ALL
  SELECT 1, NULL, 1
)
SELECT r.* FROM (
  SELECT 
    [
    STRUCT<column STRING, count_of_errors INT64>
      ('cnt_col1', SUM(col1)),
      ('cnt_col2', SUM(col2)),
      ('cnt_col3', SUM(col3))
    ] AS row
  FROM `data`
), UNNEST(row) AS r   

It is simple enough and friendly for adjusting to any number of columns you potentially have in your initial `data` table - you just need to add respective number of ('cnt_colN', SUM(colN)), - which can be done manually or you can just write simple script to generate those lines (or whole query)

Upvotes: 8

Related Questions