Reputation: 7957
I need to compare data between 2 external tables (A and A') with the same columns and i actually make some sql scripts to do this but is not enough efficient and robust. there are a way to do this with UDAF or external tools?
Here one sample:
Table A
+----+----+----+
| C1 | C2 | C3 |
+----+----+----+
| 01 | A | 02 |
| 02 | B | 03 |
| 03 | C | 04 |
+----+----+----+
Table A'
+----+----+----+
| C1 | C2 | C3 |
+----+----+----+
| 01 | B | 02 |
| 02 | B | 03 |
| 04 | D | 04 |
+----+----+----+
Output
+------+------+------+------+------+------+
| C1 | C2 | C3 | C1' | C2' | C3' |
+------+------+------+------+------+------+
| 01 | A | 02 | 01 | B | 02 |
| 03 | C | 04 | null | null | null |
| null | null | null | 04 | D | 04 |
+------+------+------+------+------+------+
Upvotes: 0
Views: 2421
Reputation: 445
There is a new tool to compare data. DiffTool (https://dataq.io/) is a UI based data comparison tool that can be used across data sources like RDBMS, Hadoop(Hive) or text files (CSV & JSON). DiffTool Compares datasets based on key(s), and some of the features include:
controlling data volume using custom filters, Transforming columns using SQL expressions, The ability to scale using distributed architecture, And analyzing results intuitively with rich visualization
Upvotes: 1
Reputation: 1810
You could do this as a UDF if the comparison id b/w rows from table A and table A' based on a join condition which in this case i believe is the first column C1.
Approach:
select UDF_FUNCTION(a.c1,a.c2.a.c3, A'.c2,a'.c3) from A JOIN A' ON (A.c1=a'.c1)
In the UDF you can compare the values as per your requirement and return an array or delimited text.
Also what do you mean by
some sql scripts to do this but is not enough efficient and robust
Upvotes: 0
Reputation: 618
You have to write the custom MR Program for this. We have already had these kind of problems. You have to write a custom MR Program are custom UDF. There is no in build method to solve this problem.
Upvotes: 0