Removing duplicate lines with different columns

Question

I have a file which looks like follows:

  ENSG00000197111:I12 0
  ENSG00000197111:I12 1
 ENSG00000197111:I13 0
 ENSG00000197111:I18 0
 ENSG00000197111:I2 0
 ENSG00000197111:I3 0
 ENSG00000197111:I4 0
 ENSG00000197111:I5 0
 ENSG00000197111:I5 1

I have some lines that are duplicated but I cannot remove by sort -u because the second column has different values for them (1 or 0). How do I remove such duplicates by keeping the lines with second column as 1 such that the file will be

 ENSG00000197111:I12 1
 ENSG00000197111:I13 0
 ENSG00000197111:I18 0
 ENSG00000197111:I2 0
 ENSG00000197111:I3 0
 ENSG00000197111:I4 0
 ENSG00000197111:I5 1

Jose Ricardo Bustos M. · Accepted Answer

you can use awk and or operator, if the order isn't mandatory

awk '{d[$1]=d[$1] || $2}END{for(k in d) print k, d[k]}' file

you get

ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1
ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0

Edit, only sort solution

You can use sort with a double pass, example

sort -k1,1 -k2,2r file | sort -u -k1,1

you get,

ENSG00000197111:I12 1
ENSG00000197111:I13 0
ENSG00000197111:I18 0
ENSG00000197111:I2 0
ENSG00000197111:I3 0
ENSG00000197111:I4 0
ENSG00000197111:I5 1

Removing duplicate lines with different columns

Answers (1)

Related Questions