Scott M. Stolz
Scott M. Stolz

Reputation: 774

MySQL Import: Preventing duplicates when importing records that have no primary key

We receive data from a third party and unfortunately, that data contains unique records but does not contain a unique key. We need to import this information into a MySQL database, preferably using PHP, but need to make sure that rows are not duplicated.

The entries never change once they've been generated, as they are time-stamped snapshots of balances in accounts.

When the dataset was smaller, we simply just refreshed the entire database by deleting all of the data and the re-importing the whole thing again. As the dataset grows in size, this is not a sustainable solution as it uses too much system resources and takes more and more time.

First, we dump the imported data into a temporary MySQL table (source data has no primary key):

Table Structure: Imported Data

    BalanceDate DATE NOT NULL COMMENT 'Date Balance Was Fetched From Bank',
    BalanceTime TIME NOT NULL COMMENT 'Time Balance Was Fetched from Bank',
    AccountName VARCHAR(100) DEFAULT NULL COMMENT 'Name of Account Downloaded from Source',
    AccountNo VARCHAR(50) DEFAULT NULL COMMENT 'Account Number Downloaded from Source',
    InstName VARCHAR(150) DEFAULT NULL COMMENT 'Financial Institution Name Downloaded from Source',
    Balance DECIMAL(10, 2) NOT NULL COMMENT 'Balance in Account',
    Type VARCHAR(50) NOT NULL COMMENT 'Type of Account',
    Class VARCHAR(50) NOT NULL COMMENT 'Class of Account',
    Index INT(11) DEFAULT NULL COMMENT 'Index Number Downloaded from Source',

Note, the index number appears to be unique to each financial institution but is not unique to each entry.

The live table has the above fields plus an autoincremented unique ID field. We've also added some columns for foreign keys which we convert with a simple IF THEN statement.

Table Structure: Additional Fields in Live Table

    ID INT(11) NOT NULL AUTO_INCREMENT,
    AccountID INT(11) NOT NULL COMMENT 'Linked to Accounts table (Account at Financial Institution)',
    InstID INT(11) NOT NULL COMMENT 'Links to Institution table (Financial Institution)',

PHP

    // Accounts and Banks and prepopulated into the database in other tables.
    // This is a simplified example of the logic. Actual code a bit more complicated.
    if ($AccountName == "Whatever Account" AND $InstName == "Whatever Bank") {
       $AccountID = 1;
       $InstID = 1;
    }

The goal is to transfer the data to our database and not have entries duplicated in the process.

Basically, if the whole row already exists in the live database, don't import a second copy of it.

Upvotes: 1

Views: 1125

Answers (1)

Barmar
Barmar

Reputation: 781130

Add a virtual column to the table that contains a hash of all the other columns, and specify it UNIQUE

HashCode CHAR(32) AS MD5(CONCAT_WS(',', BalanceDate, BalanceTime, AccountName, AccountNo, ...) UNIQUE

Then when you're inserting from the temporary table, you can use INSERT IGNORE. If any of the incoming rows hash the same as one of the existing rows, the index on HashCode will cause it to be skipped.

This will slow down the import process, since every new row will have to be hashed and checked against the index. There's no such thing as a free lunch, and it won't require as much storage as indexing the entire row.

Upvotes: 1

Related Questions