Onrain
Onrain

Reputation:

MySQL efficient "select id else insert" query

I have a MySQL table consisting of:

CREATE TABLE `url_list` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `crc32` int(10) unsigned NOT NULL,
  `url` varchar(512) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `crc32` (`crc32`)
);

When inserting data into a related table I need to lookup the primary key from this table, and using the crc32 really speeds that up whilst allowing a small index. The URLs do need to be unique, but I'd like to avoid having more index than actual data.

If the value isn't present I need to insert it, but using structures such as INSERT IGNORE, or ON DUPLICATE KEY either requires me to put a unique on the huge varchar, or don't take advantage of my index.

How can I "SELECT id else INSERT", whilst preserving the lookup speed for the 80-90% of hits that are already in the table?

Upvotes: 2

Views: 1676

Answers (2)

Bill Karwin
Bill Karwin

Reputation: 562358

I would recommend ditching the id column and the crc32 because they're not necessary.

You can use an MD5() hash to provide a fixed-length, virtually unique value computed from the lengthy URL data, and then use that hash as the primary key.

CREATE TABLE `url_list` (
  `url_hash` BINARY(16) NOT NULL PRIMARY KEY
  `url`      VARCHAR(512) NOT NULL
);

DELIM !!
CREATE TRIGGER `url_ins` BEFORE INSERT ON `url_list`
FOR EACH ROW
BEGIN
  SET NEW.`url_hash` = UNHEX( MD5( NEW.`url` ) );
END!!

Then you can use INSERT..ON DUPLICATE KEY UPDATE because unlike crc32, the hash should have a very low chance of collision.

edit: See http://en.wikipedia.org/wiki/Birthday_attack. If you log 1 million distinct URL's per day for 2,000 years, the MD5 hashes of these URL's are still less likely to include a collision than your hard disk is to have an uncorrectable bit error.

Upvotes: 3

Adriano Varoli Piazza
Adriano Varoli Piazza

Reputation: 7429

This website offers a solution to a similar problem.

Upvotes: 1

Related Questions