Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

Question

I am trying to extract all the duplicates based on the first column/index of my very large text/csv file (7+ GB / 100+ Million lines). Format is like so:

foo0:bar0
foo1:bar1
foo2:bar2

first column is any lowercase utf-8 string and the second column is any utf-8 string. I have been able to sort my file based on the first column and only the first column with:

sort -t':' -k1,1 filename.txt > output_sorted.txt

I have also been able to drop all duplicates with:

sort -t':' -u -k1,1 filename.txt > output_uniq_sorted.txt

These operations take 4-8 min.

I am now trying to extract all duplicates based on the first column and only the first column, to make sure all entries in the second columns are matching.

I think I can achieve this with awk with this code:

BEGIN { FS = ":" }
{   
    count[$1]++;

    if (count[$1] == 1){
        first[$1] = $0;
    }

    if (count[$1] == 2){
        print first[$1];
    }

    if (count[$1] > 1){
        print $0;
    }
}

running it with:

awk -f awk.dups input_sorted.txt > output_dup.txt

Now the problem is this takes way to long 3+hours and not yet done. I know uniq can get all duplicates with something like:

uniq -D sorted_file.txt > output_dup.txt

The problem is specifying the delimiter and only using the first column. I know uniq has a -f N to skip the first N fields. Is there a way to get these results without having to change/process my data? Is there another tool the could accomplish this? I have already used python + pandas with read_csv and getting the duplicates but this leads to errors (segmentation fault) and this is not efficient since I shouldn't have to load all the data in memory since the data is sorted. I have decent hardware

i7-4700HQ
16GB ram
256GB ssd samsung 850 pro

Anything that can help is welcome, Thanks.

SOLUTION FROM BELOW

Using:

awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c'

with the command time I get the following performance.

real    0m46.058s
user    0m40.352s
sys     0m2.984s

karakfa · Accepted Answer

If your file is already sorted you don't need to store more than one line, try this

$ awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c' sorted.input

If you try this please post the timings...

Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

Answers (2)

Related Questions