MySQL Database Modeling for large data sets

Question

A client wants to compile a bunch of data for his customers from a bunch of different sources. I'm building on a PHP/MySQL server architecture. All of my experience is in front-end dev and design, so I'm running into problems with performance, now that there are a lot of data sets.

The performance problem is de-duplication. The main db table stores domains and has four columns: 'id', 'domain_name', and two booleans used to determine whether or not a domain is a possible target for the customers. There is an INDEX on the 'domain_name' column.

I don't want multiple rows for the same domain. The domains arrive in sets of 30,000, and right now I am using:

if(!(Domain::find_by_domain($d->n))) {
    // insert into db
}

I've also tried:

$already_in_db = Domain::list_domains();
if(!in_array($already_in_db)) {
    // insert into db
}

There are only about 170,000 domains in the table right now, and both methods already take an extremely long time.

1) Will setting a UNIQUE INDEX on the domain column cause dupes to just be discarded?

2) Are there any other methods to speed up this process?

eggyal · Accepted Answer

Make your index on the domain name column UNIQUE, then your INSERT statements will fail if the domain already exists (you can use REPLACE or INSERT ... ON DUPLICATE KEY UPDATE if you want to change the data in the event of such a collision):

ALTER TABLE tbl_name
  DROP INDEX name_of_existing_index,
  ADD  UNIQUE INDEX name_of_existing_index (domain_name);

MySQL Database Modeling for large data sets

Answers (1)

Related Questions