Reputation: 191
I have a ~20GB csv file. Sample file:
1,[email protected],M
2,[email protected],M
1,[email protected],F
3,[email protected],F
The primary key in this file is the first column. I need to write two file, uniq.csv and duplicates.csv
uniq.csv should contain all non-duplicate records and duplicates.csv will contain all duplicate records with current timesstamp.
uniq.csv
1,[email protected],M
2,[email protected],M
3,[email protected],F
duplicates.csv
2012-06-29 01:53:31 PM, 1,[email protected],F
I am using Unix Sort so that I can take advantage of its External R-Way merge sorting algorithm
To identify uniq records
tail -n+2 data.txt | sort -t, -k1 -un > uniq.csv
To identify duplicate records
awk 'x[$1]++' FS="," data.txt | awk '{print d,$1}' "d=$(date +'%F %r')," > duplicates.csv
I was wondering if there is anyway to find both duplicates and uniq with a single scan of this large file?
Upvotes: 2
Views: 4033
Reputation: 86
Your awk
script is nearly there. To find the unique lines, you merely need to use the in
operator to test whether the entry is in the associate array or not. This allows you to collect the data in one pass through the data file and to avoid having to call sort
.
tail -n +2 data.txt | \
awk '
BEGIN { OFS=FS="," }
{
if (!($1 in x)) {
print $0 > "/dev/fd/3"
}
x[$1]++
}
END {
for (t in x) {
print d, t, x[t]
}
}' d="$(date +'%F %r')" 3> uniq.csv > duplicates.csv
Upvotes: 2
Reputation: 7745
Here is a code on perl which will do processing in one scan
#!/usr/bin/perl
open(FI,"sort -t, -k1 < file.txt |");
open(FD,">duplicates.txt");
open(FU,">uniques.txt");
my @prev;
while(<FI>)
{
my (@cur) = split(',');
if($prev[0] && $prev[0]==$cur[0])
{
print FD localtime()." $_";
}
else
{
print FU $_;
}
@prev=@cur;
}
Upvotes: 0
Reputation: 135
I got this question in an interview, a couple jobs ago.
One answer is to use uniq with the "-c" (count) option. An entry with a count of "1' is unique, and otherwise not unique.
sort foo | uniq -c | awk '{ if ($1 == 1) { write-to-unique } else {write-to-duplicate }'
If you want to write a special-purpose program and/or avoid the delay caused by sorting, I would use Python.
Read the input file, hashing each entry and ++ an integer value for each unique key that you encounter. Remember that hash values can collide even when the two items are not equal, so keep each key individually along with its count. At EOF on the input, traverse the hash structure and spit each entry into one of two files.
You seem not to need sorted output, only categorized output, so the hashing should be faster. Constructing a hash is O(1), while sorting is O(I forget; is unix sort Nlog(N)?)
Upvotes: 0