KarthikRajagopal
KarthikRajagopal

Reputation: 191

Remove duplicates from a large file

I have a ~20GB csv file. Sample file:

1,[email protected],M
2,[email protected],M
1,[email protected],F
3,[email protected],F

The primary key in this file is the first column. I need to write two file, uniq.csv and duplicates.csv

uniq.csv should contain all non-duplicate records and duplicates.csv will contain all duplicate records with current timesstamp.

uniq.csv

1,[email protected],M
2,[email protected],M
3,[email protected],F

duplicates.csv

2012-06-29 01:53:31 PM, 1,[email protected],F

I am using Unix Sort so that I can take advantage of its External R-Way merge sorting algorithm

To identify uniq records
tail -n+2 data.txt | sort -t, -k1 -un > uniq.csv

To identify duplicate records
awk 'x[$1]++' FS="," data.txt | awk '{print d,$1}' "d=$(date +'%F %r')," > duplicates.csv

I was wondering if there is anyway to find both duplicates and uniq with a single scan of this large file?

Upvotes: 2

Views: 4033

Answers (3)

Jules Reid
Jules Reid

Reputation: 86

Your awk script is nearly there. To find the unique lines, you merely need to use the in operator to test whether the entry is in the associate array or not. This allows you to collect the data in one pass through the data file and to avoid having to call sort.

tail -n +2 data.txt | \
awk '
    BEGIN { OFS=FS="," }
    {
        if (!($1 in x)) {
            print $0 > "/dev/fd/3"
        }
        x[$1]++
    }
    END {
        for (t in x) {
            print d, t, x[t]
        }
    }' d="$(date +'%F %r')" 3> uniq.csv > duplicates.csv

Upvotes: 2

amaksr
amaksr

Reputation: 7745

Here is a code on perl which will do processing in one scan

#!/usr/bin/perl
open(FI,"sort -t, -k1 < file.txt |");
open(FD,">duplicates.txt");
open(FU,">uniques.txt");
my @prev;
while(<FI>)
{
    my (@cur) = split(',');
    if($prev[0] && $prev[0]==$cur[0])
    {
        print FD localtime()." $_";
    }
    else
    {
        print FU $_;
    }
    @prev=@cur;
}

Upvotes: 0

Bill Torcaso
Bill Torcaso

Reputation: 135

I got this question in an interview, a couple jobs ago.

One answer is to use uniq with the "-c" (count) option. An entry with a count of "1' is unique, and otherwise not unique.

sort foo | uniq -c | awk '{ if ($1 == 1) { write-to-unique } else {write-to-duplicate }'

If you want to write a special-purpose program and/or avoid the delay caused by sorting, I would use Python.

Read the input file, hashing each entry and ++ an integer value for each unique key that you encounter. Remember that hash values can collide even when the two items are not equal, so keep each key individually along with its count. At EOF on the input, traverse the hash structure and spit each entry into one of two files.

You seem not to need sorted output, only categorized output, so the hashing should be faster. Constructing a hash is O(1), while sorting is O(I forget; is unix sort Nlog(N)?)

Upvotes: 0

Related Questions