Parm
Parm

Reputation: 568

How do I write program that gives a diff of two tables?

Scenario

I have two data pipelines; say a production pipeline and a development pipeline. I need to verify that the tables produced at the end of each pipeline are the same. These tables may be in different databases, in different data centers, and may each contain anywhere from hundreds of lines to a billion lines.

Task

I need to provide all rows that each table is missing, and all rows that are mismatched. In the case where there are mismatched rows, I need to provide the specific rows. For example:

table1:
Name | Age |
Alice| 25  |
Bob  | 49  |
Jim  | 45  |
Cal  | 52  |

table2:
Name | Age |
Bob  | 49  |
Cal  | 42  |
Alice| 35  |

Should output something similar to:

table1 missing rows (empty):
Name | Age |
     |     |

table2 missing rows:
Name | Age |
Jim  | 45  |

mismatching rows:
Name | Age | table |
Alice| 25  | table1|
Alice| 35  | table2|
Cal  | 52  | table1|
Cal  | 42  | table2|

Bob is ignored since that row matches.

If you'd like to take crack at this, feel free to skip to the Question section below.

What I've tried

I tried using Spark. Long story short, Spark isn't meant for connecting to and querying two different databases. More specifically, I can't create multiple SparkSessions, and I don't want to use Spark in a way it was not intended to be used.

I originally wanted to use Spark because I wanted to use the "except" function. This way I could subtract table2 from table1 and vice versa. These two results would contain all of my missing rows and mismatched rows and I'd be able to sort and label them. However I've had exceeding difficulty in connecting to two different databases with Spark. For example, connecting to a Hive database and a Postgres database in one program is troublesome with Spark.

Current Ideas

  1. I can continue with the idea of using Spark's "except" logic, however I would need to omit Spark and import each table into the same database. I could use the "except" keyword built into SQL. Once I have a set of tables that describe the discrepancies between table1 and table2 as well as between table2 and table1. I can sort each of them and compare row by row to find which rows should be considered "missing" and which should be considered "mismatching.'

  2. I can query all rows from each table, sort them, then do a line by line comparison. I'm not sure how this scales compared to the above idea, but the sort operation is quite costly for some of the larger tables.

Note: I plan on achieving scalability and elasticity by containerizing the program and providing it different compute resources depending on the size of the job.

Question

What algorithm should I use to take an input and give an output like the example listed in the Task section?

I'm not asking for which specific technologies to use, however, suggestions are welcome. Perhaps place them in the comments section of your answer. I'm asking for a high level algorithm to achieve the the input and output detailed in Task.

Edit:

You can assume there is a primary key for each table. However the columns still need to be checked for equality.

Upvotes: 1

Views: 144

Answers (1)

James
James

Reputation: 3015

Can you try with something like this?

/* table1 missing rows */
select t2.name, t2.age
from table1 t1
  right join table2 t2 
    on t1.name=t2.name
where t1.name is null

/* table2 missing rows */
select t1.name, t1.age
from table1 t1
  left join table2 t2 
    on t1.name=t2.name
where t2.name is null

/* mismatching rows */
select
  coalesce(t1.name, t2.name) as name,
  coalesce(t1.age, t2.age) as age,
  case when t1.name is not null then 'table1' else 'table2' end
from table1 t1
  full join table2 t2
      on t1.name=t2.name
     and t1.age=t2.age 
where t1.age is null
   or t2.age is null

Upvotes: 1

Related Questions