Radical_Activity
Radical_Activity

Reputation: 2738

How to avoid inserting duplicates into MySQL?

I have an analytics platform with lots of users and hundreds of inserting clicks / minute.

Sometimes I see that the exact same click is inserted to the Database within the same second and it becomes a duplicate of the other.

I have a system which checks if the table has the same value and not letting the other inserted if it finds one.

However in this case it looks to me that they're inserted into the DB in the exact same milisecond.

What can I do here?

Upvotes: 1

Views: 107

Answers (2)

Drew
Drew

Reputation: 24970

My favorite: insert ignore myTable (col1, col2, ...) ...

where unique key(s) are setup beforehand to forbid the insert. It would appear that you do not care so much that it was previously inserted as much as you care that the end result is not dupes.

Note: the unique keys can be multi-column keys (composites)

A word of warning about insert ignore: it should not be implemented without careful thought of its ramifications for sensitive systems that need to know that the row was truly already there. It is ideal for "make sure it is there".

Option B: One could look into intention locks, like here, but crafted for your particular use-case. Steer toward INNODB row-level locking that is swifty, and certainly not table locks. Most things come with a trade-off. The downside of locking is diminished concurrency.

Option C: For the faint-of-heart (sometimes me). And this is what I would do if hired out and wish not to have peer backlash later. Perform an Insert ... on Duplicate Key Update (IODKU), and have a bogus column like touches that is an int that you increment for the Update part of the IODKU. Example below:

insert myTable (col1, col2, col3) values (p1,p2,p3) 
on duplicate key update touches=touches+1;

That above would be in a most minimalist form. A view below is what I use in C# where I care about more columns in the "update part of IODKU", but just to show that, if it benefits anyone:

enter image description here

A final thought on IODKU: it is mandatory to have a unique key (primary or just unique) that causes the "clash" to occur. Thus, the statement knows whether or not to perform the insert or the update. Without such a unique key clash, a new row will be inserted.

Back to the op issue, the reason your system probably already had the row there was due to high concurrency use without locking.

Upvotes: 2

olegsv
olegsv

Reputation: 1462

If the system's architecture allows it I would create two-tier solution. First, a temporary table where duplicate data would be inserted. The temporary table's name can contain a sharding parameter, for example, an hour number. The system will periodically export data from temporary tables into the main storage table, discarding duplicate data. Then it can discard the temporary tables.

Upvotes: 0

Related Questions