Reputation: 21
I would like to automate this processes for multiple datasets. I am not sure if there exists a tool like Alteryx or any software where I can do the following.
Here is the data that exists in both SIT and PROD:
ID | Block | Amt A | Amt B | Amt C | .... |
---|---|---|---|---|---|
1 | X | 1000 | 400 | 1000 | .... |
2 | Y | 2000 | 500 | 4000 | .... |
3 | Z | 3000 | 500 | 3000 | .... |
2 | Y | 2000 | 3000 | 2000 | .... |
3 | Z | 3000 | 4500 | 3000 | .... |
Block | Counts | Sum Amt A | Sum Amt B |
---|---|---|---|
X | 1000 | 40000 | 10000 |
Y | 2000 | 45000 | 20000 |
Z | 3000 | 50000 | 30000 |
Block | Counts | Sum Amt A | Sum Amt B |
---|---|---|---|
X | 1000 | 40000 | 15000 |
Y | 2000 | 47000 | 20000 |
Z | 3000 | 50000 | 30000 |
FYI - There are multiple columns like block that could be used on aggregation. One could choose any column, get an aggregation to find inconsistencies and then explore into that particular block.
Upvotes: 0
Views: 31
Reputation: 55
It's a typical full join, then filter process, you can do it in SQL, Python dataframe or Alteryx, pretty much anything
`
WITH
sum_prod AS
(
SELECT
SUM(AMT_A) SUM_AMT_A,
COUNT(1),..
FROM prod
GROUP BY ...
)
,
sum_SIT AS
(
SELECT
SUM(AMT_A) SUM_AMT_A,
COUNT(1),..
FROM sit
GROUP BY ...
)
SELECT
*
FROM sum_prod
FULL JOIN
sum_sit
ON
sum_prod.block = sum_sit.block
WHERE
sum_prod.Sum_Amt_A <> sum_sit.Sum_Amt_A
OR sum_prod.Sum_Amt_A IS NULL
OR sum_sit.Sum_Amt_A IS NULL -- always take care of one side missing
--add however_your_condition_to_tell_diff
`
Same principle in Alteryx, just a join and a selection filter will do
Upvotes: 1