recommended techniques for merging database records

Question

I am working on a system where database records are periodically created based on an input stream of data. Occasionally some input comes along that provides evidence that two independently created records should be merged into one. I am looking for recommendations on ways to effect the merge in the database.

The main table (which is merely a design at this point) contains records consisting of a unique ID (call it the main ID, which is assigned by the database, MySQL in my system), and some data fields. There are also some other tables that use the main ID to link their records to a record in the main table.

MainTable:
int   mainID
blob  data
...

OtherTable:
int   otherID
int   mainID
blob  otherData
...

Now if each record has never been shared to any external process or system, it is straightforward to somehow blend the data fields from one record into another and delete the record for the one. It is also straightforward (if tedious and/or inefficient) to update the main ID fields in the other tables to the main ID value we are keeping.

Things get complicated when the ID for each record has been shared outside the system. In this case, I think it should be unreasonable to have queries with those deleted IDs simply fail, though I could be convinced otherwise.

An idea I am considering is to introduce a merge table with two key fields: and original main ID and a current main ID. Its purpose is to alias one main ID to another. As each main table record is created, we add a record to the merge table mapping the main ID of the newly created main table record to itself. If a merge occurs, we simply update the current main ID field in the merge table for the record with the original main ID for the main record that is being merged away. Then, for every query based on a main ID, we map that ID through the merge table to find the effective main ID we should really use.

MergeTable:
int   mergeID
int   originalMainID
int   currentMainID

Is this a good technique? Can the mapping be done seamlessly in SQL queries? Are there standard or better techniques I should be considering instead?

In doing research on this topic I found surprisingly few examples of this. This question is close, but the merge scenario is different from mine, or so it seems to me. I know a bit about databases, but am by no means an expert, so I probably don't know the right terms to search for.

Marcus Adams · Accepted Answer

I like your design idea, but consider one where you store only replaced records in your merge table, not all of them. This reduces storage and improves speed, given the following query:

SELECT *
  FROM MainTable
  WHERE mainID = 1
UNION ALL
SELECT MainTable.*
  FROM MergeTable
  INNER JOIN MainTable
    ON MainTable.mainID = MergeTable.currentMainID
  WHERE MergeTable.originalMainID = 1
LIMIT 1

The idea is that in most cases, the first query will succeed and return a result, and MySQL will abort the second query since the LIMIT is fulfilled. If the first query returns no results, then it will proceed to the second query and perform the join on the merge table to see if it's been merged.

According to MySQL, regarding LIMIT:

As soon as MySQL has sent the required number of rows to the client, it aborts the query unless you are using SQL_CALC_FOUND_ROWS.

If merged records are the exception, not the rule, then this will save many, many joins.

You could also do this with two queries if the UNION query is too scary. You could simply check to see if the record exists, and if not, then check the merge table.

recommended techniques for merging database records

Answers (1)

Related Questions