Reputation: 31
Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?
Upvotes: 3
Views: 3630
Reputation: 1606
Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.
Upvotes: 1
Reputation: 26882
Oh and if I were you, I will put the unique constraint on the DB anyways...
Upvotes: 8
Reputation: 33092
Your problem can be solve with a Extract, transform, load (ETL) approach:
You can do this manually or use an ETL tool.
Upvotes: 0
Reputation: 64026
Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.
Upvotes: 0
Reputation: 39485
You have two options,
do it in Java: you could put together something like a HashSet
for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.
Upvotes: 1
Reputation: 3526
I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.
Upvotes: 1