Reputation: 2994

Removing duplicates from a huge file

We have a huge chunk of data and we want to perform a few operations on them. Removing duplicates is one of the main operations.

Ex.

a,me,123,2631272164
yrw,wq,1237,123712,126128361
yrw,dsfswq,1323237,12xcvcx3712,1sd26128361

These are three entries in a file and we want to remove duplicates on the basis of 1st column. So, 3rd row should be deleted. Each row may have different number of columns but the column we are interested into, will always be present.

In memory operation doesn't look feasible.

Another option is to store the data in database and removing duplicates from there but it's again not a trivial task. What design should I follow to dump data into database and removing duplicates?

I am assuming that people must have faced such issues and solved it.

How do we usually solve this problem?

PS: Please consider this as a real life problem rather than interview question ;)

Upvotes: 1

Answers (5)

Vatine

Reputation: 21288

If you need to preserve order in your original data, it MAY be sensible to create new data that is a tuple of position and data, then sort on the data you want to de-dup. Once you've sorted by data, de-duplication is (essentially) a linear scan. After that, you can re-create the original order by sorting on the position-part of the tuple, then strip it off.

Say you have the following data: a, c, a, b

With a pos/data tuple, sorted by data, we end up with: 0/a, 2/a, 3/b, 1/c

We can then de-duplicate, trivially being able to choose either the first or last entry to keep (we can also, with a bit more memory consumption, keep another) and get: 0/a, 3/b, 1/c.

We then sort by position and strip that: a, c, b

This would involve three linear scans over the data set and two sorting steps.

Upvotes: 0

JustinDanielson

Reputation: 3185

If the number of keys is also infeasible to load into memory, you'll have to do a Stable(order preserving) External Merge Sort to sort the data and then a linear scan to do duplicate removal. Or you could modify the external merge sort to provide duplicate elimination when merging sorted runs.

I guess since this isn't an interview question or efficiency/elegance seems to not be an issue(?). Write a hack python script that creates 1 table with the first field as the primary key. Parse this file and just insert the records into the database, wrap the insert into a try except statement. Then preform a select * on the table, parse the data and write it back to a file line by line.

Upvotes: 5

Dan D.

Reputation: 74675

If the input is sorted or can be sorted, then one could do this which only needs to store one value in memory:

r = read_row()
if r is None:
    os.exit()
last = r[0]
write_row(r)
while True:
    r = read_row()
    if r is None:
        os.exit()
    if r[0] != last:
        write_row(r)
        last = r[0]

Otherwise:

What I'd do is keep a set of the first column values that I have already seen and drop the row if it is in that set.

S = set()
while True:
    r = read_row()
    if r is None:
       os.exit()
    if r[0] not in S:
       write_row(r)
       S.add(r[0])

This will stream over the input using only memory proportional to the size of the set of values from the first column.

Upvotes: 0

Ben Neill

Reputation: 3009

If you go down the database route, you can load the csv into a database and use 'duplicate key update'

using mysql:-

Create a table with rows to match your data (you may be able to get away with just 2 rows - id and data)
dump the data using something along the lines of

LOAD DATA LOCAL infile "rs.txt" REPLACE INTO TABLE data_table FIELDS TERMINATED BY ',';
You should then be able to dump out the data back into csv format without duplicates.

Upvotes: 2

Joachim Isaksson

Reputation: 181047

If the number of unique keys aren't extremely high, you could simply just do this;
(Pseudo code since you're not mentioning language)

Set keySet;

while(not end_of_input_file)
    read line from input file
    if first column is not in keySet
        add first column to keySet
        write line to output file
end while

Upvotes: 0

Removing duplicates from a huge file

Answers (5)

Related Questions