Reputation: 23
I need to compare 2 databases on their datas. But I have to compare only the data, not the keys, which may be different (but the relations between objets must be the same).
Actually it is a need to verify the proper functioning of a data conversion tool.
I use an entity-framework on C# .Net and Sqlite, here's the database schema:
public partial class Document
{
[PrimaryKey, AutoIncrement]
public Int64 idDocument { get; set; }
public string name { get; set; }
}
public partial class Diagram
{
[PrimaryKey, AutoIncrement]
public Int64 idDiagram { get; set; }
public string title { get; set; }
}
public partial class Document_contains_diagram
{
public Int64 idDocument { get; set; }
public Int64 idDiagram { get; set; }
}
The database :
Tables in order : Document - Diagram - Document_contains_diagram
idDocument | name | idDiagram | title | idDocument | idDiagram | ||
---|---|---|---|---|---|---|---|
1 | "myFirstDocument" | 1 | "evolution" | 1 | 1 | ||
2 | "mySecondDocument" | 2 | "futur" | 2 | 2 |
Should be considered as same as:
idDocument | name | idDiagram | title | idDocument | idDiagram | ||
---|---|---|---|---|---|---|---|
3 | "myFirstDocument" | 8 | "evolution" | 3 | 8 | ||
4 | "mySecondDocument" | 6 | "futur" | 4 | 6 |
It's an example, the true database I want to compare contains more than 50 tables and often more than 10000 entries (RAM limit is a contraint too).
So I'm looking for an generic algorythm, or a tool to compare theses 2 databases that might help me in my search. All tools I've seen only check for strict equality of the tables. Maybe converting the database into a graph would works ?
Upvotes: 0
Views: 886
Reputation: 46507
I don't know of an automated tool to do this, but I would recommend building a one-off solution for finding a canonical arrangement of the data. Then comparing canonical versions.
What I mean is this. You'd export all of the data, and import it into SQLite. Then for each table you define a canonical order. You then migrate IDs into canonical ones, then migrate all of the IDs. Repeat.
Much of the structure can be automatically generated. But I'd recommend automatically generating a configuration that you then fix for the weird edge cases that always exist.
In your case the configuration might look something like this:
[
# table, column, orderBy, foreign references
# each order by is [column list]
# each foreign reference is [table, column]
["Document", "idDocument", ["name"], [["Document_contains_diagram", "idDocument"]]],
["Diagram.idDiagram", ["title"], [["Document_contains_diagram", "idDiagram"]]],
...
]
What your tool would do with it is:
For each entry:
Create a conversion table with a query like:
create table tmp_convert as (
select {table}.{id_column} as old_id,
row_number() over (order by {columns here}) as new_id
from {table}
);
Add an index to tmp_convert.old_id
Update {table}.{id_column} using tmp_convert
for each foreign reference:
Update foreign reference using tmp_convert
drop tmp_convert
So in your example it would take the first database, and after a bunch of work find that it was in canonical order. It would take the second database, and for idDocument
would replace 3 with 1 everywhere, and 4 with 2. For idDiagram
it would replace 8 with 1 and 6 with 2. Once it did this, the two databases are now identical, so they would match.
Upvotes: 0