mirix
mirix

Reputation: 523

Awk: Remove duplicate lines with conditions

I have a tab-delimited text file with 8 columns:

Erythropoietin Receptor Integrin Beta 4 11.7    9.7 164 195 19  3.2
Erythropoietin Receptor Receptor Tyrosine Phosphatase F 10.8    2.6 97  107 15  3.2
Erythropoietin Receptor Leukemia Inhibitory Factor Receptor 12.0    3.6 171 479 14  3.2
Erythropoietin Receptor Immunoglobulin 9    10.4    3.1 100 108 24  3.3
Erythropoietin Receptor Collagen Alpha 1 Xx 10.7    2.7 93  105 18  3.3
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 5    11.4    3.2 114 114 25  1.7
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 14   11.1    2.1 99  100 28  1.8
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 1B   10.9    4.9 133 162 29  1.9
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 11A  11.5    5.1 130 166 25  1.9

The first and second column contain protein names and the 8th column contains the "distance" score between each protein pair. I would like to remove the lines containing duplicate protein pairs and keep only the pair with the lowest distance (the lowest value in the 8th column). This means that for the pair Protein A-Protein B I would like to remove all occurrences except the one with the lowest distance score. The pair is considered duplicate even if the protein names are swapped (in different columns). This means that Protein A Protein B is the same as Protein B Protein A.

Upvotes: 1

Views: 777

Answers (2)

Kent
Kent

Reputation: 195029

I hope this would be the final update ^_^

kent$  awk -F'\t' '{if($1$2 in a){
                if($8<a[$1$2]){
                        a[$1$2]=$8;r[$1$2]=$0;
                }
        }else if ($2$1 in a){
                if($8<a[$2$1]){
                        a[$2$1] = $8;r[$2$1] = $0;
                }
        }else{
                a[$1$2]=$8; r[$1$2]=$0;
        }
} END{for(x in r)print r[x]}' yourFile

Upvotes: 1

Dimitre Radoulov
Dimitre Radoulov

Reputation: 27990

Something like this (untested):

awk -F'\t' 'END {
  for (r in rec) print rec[r] 
  }
{
  if (mina[$1, $2] < $NF || minb[$2, $1] < $NF) {
    mina[$1, $2] = $NF; minb[$2, $1] = $NF
    rec[$1, $2] = $0
    }  
  }' infile

Upvotes: 3

Related Questions