Angelo
Angelo

Reputation: 5059

removal of redundant lines based on value in last column

I am trying to remove redundant data from my tab delimited file, which looks like this

chr1    1841    1851    4
chr1    1991    2001    3
chr1    2491    2501    2
chr1    2491    2501    2
chr1    2501    2511    1
chr1    2681    2691    3
chr1    2881    2891    4
chr1    2891    2901    1
chr1    3241    3251    1
chr1    3241    3251    6

Condition is: if first three columns are same then the value in the fourth column (highest value) should be used to get the first three columns and also the fourth column. If there is a tie then only once the value in 4 columns should be printed.

So for the above input the ideal output should be

chr1    1841    1851    4
chr1    1991    2001    3
chr1    2491    2501    2
chr1    2501    2511    1
chr1    2681    2691    3
chr1    2881    2891    4
chr1    2891    2901    1
chr1    3241    3251    6

How I approached:

sorted the file first and then I messed it

sort file | awk -F '\t' 'NR==1{last = $1; max = 0} {if (last != $1) {printf "%s\t%e\n", last, max; last = $1; max = $4} else if (max < $4) max = $4} END{printf "%s\t%e\n", last, max}'

Kindly help

Upvotes: 1

Views: 227

Answers (4)

Thor
Thor

Reputation: 47169

You could do this by using the first three columns as a key into a hash, and only remember the one with the largest $4:

<infile awk '
  BEGIN { FS = OFS = "\t" }

  $4 > h[$1,$2,$3] { h[$1,$2,$3] = $4 }

  END { 
    for(k in h) { 
      split(k, a, SUBSEP)
      print a[1], a[2], a[3], h[k]
    }
  }
' | sort -k1 -k2n,3n

Output:

chr1    1841    1851    4
chr1    1991    2001    3
chr1    2491    2501    2
chr1    2501    2511    1
chr1    2681    2691    3
chr1    2881    2891    4
chr1    2891    2901    1
chr1    3241    3251    6

If you are using GNU awk you could also do the sorting from within awk:

parse.awk

BEGIN { FS = OFS = "\t" }

$4 > h[$1,$2,$3] { h[$1,$2,$3] = $4 }

END { 
  len = asorti(h, d)
  for(i=1; i<=len; i++) {
    flen = split(d[i], a, SUBSEP)
    for(j=1; j<=flen; j++)
      printf "%s%s", a[j], OFS
    print h[d[i]]
  }
}

And run it like this:

awk -f parse.awk infile

Upvotes: 1

Chris Seymour
Chris Seymour

Reputation: 85845

$ sort -k1 -k2,3n -k4nr file | awk '!a[$1,$2,$3]++'
chr1    1841    1851    4
chr1    1991    2001    3
chr1    2491    2501    2
chr1    2501    2511    1
chr1    2681    2691    3
chr1    2881    2891    4
chr1    2891    2901    1
chr1    3241    3251    6

Upvotes: 3

Kent
Kent

Reputation: 195169

this one-liner should give the output:

awk -F'\t' -v OFS="\t" '{t=$1FS$2FS$3;if(!(t in a)||a[t]<$4)a[t]=$4}END{for(x in a) print x,a[x]}' file|sort

clear format:

 awk -F'\t' -v OFS="\t" '{
    t=$1FS$2FS$3
    if(!(t in a)||a[t]<$4)
        a[t]=$4
 }
 END{for(x in a) print x,a[x]}' file|sort

if you run with your data file:

kent$  cat file
chr1    1841    1851    4
chr1    1991    2001    3
chr1    2491    2501    2
chr1    2491    2501    2
chr1    2501    2511    1
chr1    2681    2691    3
chr1    2881    2891    4
chr1    2891    2901    1
chr1    3241    3251    1
chr1    3241    3251    6

kent$  awk -F'\t' -v OFS="\t" '{t=$1FS$2FS$3;if(!(t in a)||a[t]<$4)a[t]=$4}END{for(x in a) print x,a[x]}' file|sort
chr1    1841    1851    4
chr1    1991    2001    3
chr1    2491    2501    2
chr1    2501    2511    1
chr1    2681    2691    3
chr1    2881    2891    4
chr1    2891    2901    1
chr1    3241    3251    6

Upvotes: 1

Barmar
Barmar

Reputation: 781626

sort -k 1,3 -k 4r file | awk 'last != $1" "$2" "$3 { print; last = $1" "$2" "$3; }'

My output:

chr1    1841    1851    4
chr1    1991    2001    3
chr1    2491    2501    2
chr1    2501    2511    1
chr1    2681    2691    3
chr1    2881    2891    4
chr1    2891    2901    1
chr1    3241    3251    6

http://ideone.com/nH6boE

Upvotes: 1

Related Questions