Chubaka
Chubaka

Reputation: 3135

awk totally separate duplicate and non-duplicates

If we have an input:

TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N

Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)

duplicate:

95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1 

non-duplicate

95,CPD-3333333,-1,c1ccccc1N

Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.

BEGIN { FS = ","; f1="a"; f2="b"}

{
# Keep count of the fields in fourth column
count[$4]++;

# Save the line the first time we encounter a unique field
if (count[$4] == 1)
    first[$4] = $0;


# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
    print first[$4] > f1 ;

# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
    print > f1;

if (count[$4] == 1)      #if (count[$4] - count[$4] == 0)    <= change to this doesn't work
    print first[$4] > f2;

duplicate output results from the attempt:

95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1

non-duplicate output results from the attempt

TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1

May I know if any guru might have comments/solutions? Thanks.

Upvotes: 0

Views: 94

Answers (4)

glenn jackman
glenn jackman

Reputation: 246807

I would do this:

awk '
    NR==FNR {count[$2] = $1; next} 
    FNR==1  {FS=","; next} 
    {
        output = (count[$NF] == 1 ? "nondup" : "dup")
        print > output
    }
' <(cut -d, -f4 input | sort | uniq -c) input

The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".


All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above

awk -F, '
    NR==FNR {count[$NF]++; next} 
    FNR==1  {next} 
    {
        output = (count[$NF] == 1 ? "nondup" : "dup")
        print > output
    }
'  input  input

Yes, the input file is given twice.

Upvotes: 4

user3442743
user3442743

Reputation:

Little late
My version in awk

awk -F, 'NR>1{a[$0":"$4];b[$4]++}
        END{d="\n\nnondupe";e="dupe"
        for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file

Another built similar to glenn jackmans but all in awk

awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203522

$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
    if (cnt[$4]++) {
        dups[$4] = nonDups[$4] dups[$4] $0 ORS
        delete nonDups[$4]
    }
    else {
        nonDups[$4] = $0 ORS
    }
}
END {
    print "Duplicates:"
    for (key in dups) {
        printf "%s", dups[key]
    }

    print "\nNon Duplicates:"
    for (key in nonDups) {
        printf "%s", nonDups[key]
    }
}

$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1

Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N

Upvotes: 2

ooga
ooga

Reputation: 15501

This solution only works if the duplicates are grouped together.

awk -F, '
  function fout(    f, i) {
    f = (cnt > 1) ? "dups" : "nondups"
    for (i = 1; i <= cnt; ++i)
      print lines[i] > f
  }
  NR > 1 && $4 != lastkey { fout(); cnt = 0 }
  { lastkey = $4; lines[++cnt] = $0 }
  END { fout() }
' file

Upvotes: 1

Related Questions