paullb
paullb

Reputation: 4335

Maintaining a list of unique values in a database

Let's say you have a random number generator spitting out numbers between 1 and 100 000 000 and you want to store them in a database (MySQL) with the timestamp when they were generaeted. If a number that has previously been seen comes, it is discarded.

What would be the best algorithm to make this happen? SELECT then INSERT as necessary? Is there something more efficient?

Upvotes: 0

Views: 214

Answers (4)

CAFxX
CAFxX

Reputation: 30341

If you put a UNIQUE index on the column with the extracted numbers any INSERT attempting to duplicate a UNIQUE key will fail.

Therefore the easiest and most portable version will be (PHP code, but you get the idea):

function extraction() {
  do {
    $random = generate_random_number();
    $result = @mysql_query("INSERT INTO extractions(number) VALUE ($random)");
  } while (!$result);
  return $random;
}

Upvotes: 0

vyegorov
vyegorov

Reputation: 22905

  1. You can go for a SEQUENCE:

    +

    • no relations are being locked, thus best performance;
    • no race conditions;
    • portable.

    -

    • it is possible to get “gaps” in the series of numbers.
  2. You can do a SELECT ... then INSERT ...:

    +

    • no gaps, you can also do some complicated math on your numbers.

    -

    • it's possible to get another parallel session in the middle between SELECT and INSERT and end up with 2 equal numbers;
    • if there's a UNIQUE constraint, then previos situation will lead to an exception;
    • to avoid such situation, you might go for an explicit table locks, but this will cause an immediate performance impact.
  3. You can choose INSERT ON DUPLICATE KEY UPDATE, and by now it seems to be the best option (take a look at "INSERT IGNORE" vs "INSERT ... ON DUPLICATE KEY UPDATE"), at least in my view, with the only exception — not portable to other RDBMSes.

P.S. This article is not related to MySQL, but it is worth reading it to get an overview of all the catches that can happen on your way.

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270763

This would normally be solved by creating a unique index on the random number column in the table. You could experiment to see if a b-tree versus a hash has better performance.

If you have lots of memory, you could pre-populate a table with 100,000,000 rows -- all possible values. Then, when you look to see if something is already created, then you only need to see if the time stamp is non-null. However, this would require over a Gbyte of RAM to store the table in memory, and would only be the opimal solution if you are trying to maximize transactions per second.

Upvotes: 0

Maksym Polshcha
Maksym Polshcha

Reputation: 18368

If you don't need to insert a new random value every time you can use INSERT IGNORE or REPLACE INTO. Otherwise you should SELECT to check and then INSERT.

Upvotes: 1

Related Questions