Philippe Fisher
Philippe Fisher

Reputation: 596

Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

I am trying to extract all the duplicates based on the first column/index of my very large text/csv file (7+ GB / 100+ Million lines). Format is like so:

foo0:bar0
foo1:bar1
foo2:bar2

first column is any lowercase utf-8 string and the second column is any utf-8 string. I have been able to sort my file based on the first column and only the first column with:

sort -t':' -k1,1 filename.txt > output_sorted.txt

I have also been able to drop all duplicates with:

sort -t':' -u -k1,1 filename.txt > output_uniq_sorted.txt

These operations take 4-8 min.

I am now trying to extract all duplicates based on the first column and only the first column, to make sure all entries in the second columns are matching.

I think I can achieve this with awk with this code:

BEGIN { FS = ":" }
{   
    count[$1]++;

    if (count[$1] == 1){
        first[$1] = $0;
    }

    if (count[$1] == 2){
        print first[$1];
    }

    if (count[$1] > 1){
        print $0;
    }
}

running it with:

awk -f awk.dups input_sorted.txt > output_dup.txt

Now the problem is this takes way to long 3+hours and not yet done. I know uniq can get all duplicates with something like:

uniq -D sorted_file.txt > output_dup.txt

The problem is specifying the delimiter and only using the first column. I know uniq has a -f N to skip the first N fields. Is there a way to get these results without having to change/process my data? Is there another tool the could accomplish this? I have already used python + pandas with read_csv and getting the duplicates but this leads to errors (segmentation fault) and this is not efficient since I shouldn't have to load all the data in memory since the data is sorted. I have decent hardware

Anything that can help is welcome, Thanks.

SOLUTION FROM BELOW

Using:

awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c'

with the command time I get the following performance.

real    0m46.058s
user    0m40.352s
sys     0m2.984s

Upvotes: 1

Views: 516

Answers (2)

Philippe Fisher
Philippe Fisher

Reputation: 596

I have changed the awk script slightly because I couldn't fully understand what was happening in the above awnser.

awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c>=1{if(c==1){print p0;} print $0}' sorted.input > duplicate.entries

I have tested and this produces the same output as the above but might be easier to understand.

{if(p!=$1){p=$1; c=0; p0=$0} else c++}

If the first token in the line is not the same as the previous we will save the first token then set c to 0 and save the whole line into p0. If it is the same we increment c.

c>=1{if(c==1){print p0;} print $0}

In the case of the repeat, we check if its first repeat. If thats the case we print save line and current line, if not just print current line.

Upvotes: 0

karakfa
karakfa

Reputation: 67567

If your file is already sorted you don't need to store more than one line, try this

$ awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c' sorted.input

If you try this please post the timings...

Upvotes: 1

Related Questions