Reputation: 89
I want to compare two product database based on title,
I first data is about 3 million from which I want to compare and the second data is 10 million, I am doing this because to remove duplicate products.
I have tried this by using MySQL query writing program in PHP which check title (name = '$name') if the data will return zero so it will be unique but it is quite slow 2 sec per result.
The second method I have used is storing data in the text file and using the regular expression, but it will also slow.
What is the best way to compare large data to find out unique products.?
Table DDL:
CREATE TABLE main ( id int(11) NOT NULL AUTO_INCREMENT,
name text,
image text, price int(11) DEFAULT NULL,
store_link text,
status int(11) NOT NULL,
cat text NOT NULL,
store_single text,
brand text,
imagestatus int(11) DEFAULT NULL,
time text,
PRIMARY KEY (id) )
ENGINE=InnoDB AUTO_INCREMENT=9250887
DEFAULT CHARSET=latin1;
Upvotes: 0
Views: 339
Reputation: 89
I have tried a lot using MySQL queries but data was very slow, only find out the solution is using sphinx, Index whole database and searching for every product string on sphinx index and same time removing duplicate products getting ids from sphinx.
Upvotes: 0
Reputation: 1496
Upvotes: 0
Reputation: 296
I have been using SQLyog to compare databases of around 1-2 million data. It gives an option for "One-way synchronization","Two-way synchronization" and also "Visually merge data" to sync the databases.
The important part is,it gives an option to compare data on chunks, and this value can be specified by us in writing the chunk limit inorder to avoid connection loss.
Upvotes: 0
Reputation: 723
Since you have to go over 10 mio titles 3 mio times its going to take some time. My approach would be to see if you can get all titles from both lists in a php script. Then compare them there in memory. Have the script create delete statements to a text file which you then execute on the db.
Not in your question but probably you next problem: different spellings see
similar_text()
soundex()
levenshtein()
for some help with that.
Upvotes: 2
Reputation: 2787
In my opinion this is what database are made for. I wouldn't reinvent the wheel in your shoes.
Once this is agreed, you should really check database structure and indexing to speed up your operations.
Upvotes: 0