junjoi
junjoi

Reputation: 89

Compare two big data - 20 million products

I want to compare two product database based on title,

I first data is about 3 million from which I want to compare and the second data is 10 million, I am doing this because to remove duplicate products.

I have tried this by using MySQL query writing program in PHP which check title (name = '$name') if the data will return zero so it will be unique but it is quite slow 2 sec per result.

The second method I have used is storing data in the text file and using the regular expression, but it will also slow.

What is the best way to compare large data to find out unique products.?

Table DDL:

CREATE TABLE main ( id int(11) NOT NULL AUTO_INCREMENT, 
                    name text, 
                    image text, price int(11) DEFAULT NULL, 
                    store_link text, 
                    status int(11) NOT NULL, 
                    cat text NOT NULL, 
                    store_single text, 
                    brand text, 
                    imagestatus int(11) DEFAULT NULL, 
                    time text, 
               PRIMARY KEY (id) ) 
               ENGINE=InnoDB AUTO_INCREMENT=9250887               
               DEFAULT CHARSET=latin1;

Upvotes: 0

Views: 339

Answers (5)

junjoi
junjoi

Reputation: 89

I have tried a lot using MySQL queries but data was very slow, only find out the solution is using sphinx, Index whole database and searching for every product string on sphinx index and same time removing duplicate products getting ids from sphinx.

Upvotes: 0

RojoSam
RojoSam

Reputation: 1496

  1. If your DB support it, use a left join and filter rows where the right side is not null. But first create indexes with your keys in both tables (column name).
  2. If your computer/server memory support to upload in memory the 3 millions on objects in a HashSet, then create a HashSet using the NAME as the key and then read one by one the other set (10 million objects) and validate if the object exist in the HashSet. If it exist, then it is duplicated. (I want to suggest dump the data into a text files and then read the files to create the structure)
  3. If the previous strategies fail then is time to implement some kind of MapReduce. You can implement it comparing with one of the previous approaches a subset of your data. For example, comparing all the products that start with some letter.

Upvotes: 0

Mathew
Mathew

Reputation: 296

I have been using SQLyog to compare databases of around 1-2 million data. It gives an option for "One-way synchronization","Two-way synchronization" and also "Visually merge data" to sync the databases.

The important part is,it gives an option to compare data on chunks, and this value can be specified by us in writing the chunk limit inorder to avoid connection loss.

Upvotes: 0

snitch182
snitch182

Reputation: 723

Since you have to go over 10 mio titles 3 mio times its going to take some time. My approach would be to see if you can get all titles from both lists in a php script. Then compare them there in memory. Have the script create delete statements to a text file which you then execute on the db.

Not in your question but probably you next problem: different spellings see

   similar_text()
   soundex()
   levenshtein()

for some help with that.

Upvotes: 2

Diego Ferri
Diego Ferri

Reputation: 2787

In my opinion this is what database are made for. I wouldn't reinvent the wheel in your shoes.

Once this is agreed, you should really check database structure and indexing to speed up your operations.

Upvotes: 0

Related Questions