Reputation: 211
I am performing data QA testing.
I have this query to establish any errors between the source table and the destination table.
select
count(case when coalesce(x.col1,1) = coalesce(y.col1,1) then null else 1 end) as cnt_col1,
count(case when coalesce(x.col2,"1") = coalesce(y.col2,"1") then null else 1 end) as cnt_col2
from
`DatasetA.Table` x
OUTER JOIN
`DatasetB.Table` y
on x.col1 = y.col1
The output of this query is like this:
col1, col2
null, null
null, null
1, null
null, 1
I have 200 tables that I need to perform this test on, and the number of cols
are dynamic. the table above only has two columns, some have 50.
I have the queries for the tables already, but I need to conform the output of all of the tests into a single output. My plan is to conform each query into a unified output and join them together using a UNION ALL
.
The output set should say:
COLUMN, COUNT_OF_ERRORS
cnt_col1, 1
cnt_col2, 1
...
cnt_col15, 0
My question is this. How do I reverse pivot this so I can achieve the output I'm looking for.
Thanks
Upvotes: 1
Views: 9807
Reputation: 719
About "comparing 2 tables" in Big Data, I don't think that doing some Joins is the best approach, because Joins are quite slow in general and then you have to handle the case of "outer" joins rows.
I worked on this topic years ago (https://community.hortonworks.com/articles/1283/hive-script-to-validate-tables-compare-one-with-an.html) and I am now trying to backport this knowledge to compare Hive tables with BigQuery tables.
One of my main idea is to use some checksums to be sure that a table is fully identical to the other one. Here is a "basic example":
with one_string as(
select concat( sessionid ,'|',referrercode ,'|',purchaseid ,'|',customerid ,'|', cast(bouncerateind as string),'|', cast( productpagevisit as string),'|', cast( itemordervalue as string),'|', cast( purchaseinsession as string),'|', cast( hit_time_gmt as string),'|',datedir ,'|',productcategory ,'|',post_cookies) as bigstring from bidwh2.omniture_2017_03_24_v2
),
shas as(
select TO_BASE64( sha1( bigstring)) as sha from one_string
),
shas_prefix as(
select substr( sha, 0 , 1) as prefix, sha from shas
),
shas_ordered as(
select prefix, sha from shas_prefix order by sha
),
results_prefix as(
select concat( prefix, ' ', TO_BASE64( sha1( STRING_AGG( sha, '|')))) as res from shas_ordered group by prefix
),
results_ordered as(
select 1 as myall, res from results_prefix order by res
)
select SHA1( STRING_AGG( res, '|')) as sha from results_ordered group by myall;
So you do that on each of the 2 tables, and compare the 2 checksums numbers.
Final idea is to have an Python script (not finished yet, I hope my company allows me to opensource when finished) that would do the following:
Edit on 03/11/2017: script is finished and can be found at: https://github.com/bolcom/hive_compared_bq
Upvotes: 1
Reputation: 172993
How do I reverse pivot this so I can achieve the output I'm looking for.
Assuming you have table `data`
col1 col2 col3
---- ---- ----
null null null
null null 1
null 1 1
1 null 1
1 null 1
1 null 1
And you need reverse pivot it to
column count_of_errors
-------- ---------------
cnt_col1 3
cnt_col2 1
cnt_col3 5
Below is for BigQuery Standard SQL and does exactly this
#standardSQL
WITH `data` AS (
SELECT NULL AS col1, NULL AS col2, NULL AS col3 UNION ALL
SELECT NULL, NULL, 1 UNION ALL
SELECT 1, NULL, 1 UNION ALL
SELECT NULL, 1, 1 UNION ALL
SELECT 1, NULL, 1 UNION ALL
SELECT 1, NULL, 1
)
SELECT r.* FROM (
SELECT
[
STRUCT<column STRING, count_of_errors INT64>
('cnt_col1', SUM(col1)),
('cnt_col2', SUM(col2)),
('cnt_col3', SUM(col3))
] AS row
FROM `data`
), UNNEST(row) AS r
It is simple enough and friendly for adjusting to any number of columns you potentially have in your initial `data` table - you just need to add respective number of ('cnt_colN', SUM(colN)),
- which can be done manually or you can just write simple script to generate those lines (or whole query)
Upvotes: 8