Reputation: 5059
I am trying to remove redundant data from my tab delimited file, which looks like this
chr1 1841 1851 4
chr1 1991 2001 3
chr1 2491 2501 2
chr1 2491 2501 2
chr1 2501 2511 1
chr1 2681 2691 3
chr1 2881 2891 4
chr1 2891 2901 1
chr1 3241 3251 1
chr1 3241 3251 6
Condition is: if first three columns are same then the value in the fourth column (highest value) should be used to get the first three columns and also the fourth column. If there is a tie then only once the value in 4 columns should be printed.
So for the above input the ideal output should be
chr1 1841 1851 4
chr1 1991 2001 3
chr1 2491 2501 2
chr1 2501 2511 1
chr1 2681 2691 3
chr1 2881 2891 4
chr1 2891 2901 1
chr1 3241 3251 6
How I approached:
sorted the file first and then I messed it
sort file | awk -F '\t' 'NR==1{last = $1; max = 0} {if (last != $1) {printf "%s\t%e\n", last, max; last = $1; max = $4} else if (max < $4) max = $4} END{printf "%s\t%e\n", last, max}'
Kindly help
Upvotes: 1
Views: 227
Reputation: 47169
You could do this by using the first three columns as a key into a hash, and only remember the one with the largest $4
:
<infile awk '
BEGIN { FS = OFS = "\t" }
$4 > h[$1,$2,$3] { h[$1,$2,$3] = $4 }
END {
for(k in h) {
split(k, a, SUBSEP)
print a[1], a[2], a[3], h[k]
}
}
' | sort -k1 -k2n,3n
Output:
chr1 1841 1851 4
chr1 1991 2001 3
chr1 2491 2501 2
chr1 2501 2511 1
chr1 2681 2691 3
chr1 2881 2891 4
chr1 2891 2901 1
chr1 3241 3251 6
If you are using GNU awk you could also do the sorting from within awk:
parse.awk
BEGIN { FS = OFS = "\t" }
$4 > h[$1,$2,$3] { h[$1,$2,$3] = $4 }
END {
len = asorti(h, d)
for(i=1; i<=len; i++) {
flen = split(d[i], a, SUBSEP)
for(j=1; j<=flen; j++)
printf "%s%s", a[j], OFS
print h[d[i]]
}
}
And run it like this:
awk -f parse.awk infile
Upvotes: 1
Reputation: 85845
$ sort -k1 -k2,3n -k4nr file | awk '!a[$1,$2,$3]++'
chr1 1841 1851 4
chr1 1991 2001 3
chr1 2491 2501 2
chr1 2501 2511 1
chr1 2681 2691 3
chr1 2881 2891 4
chr1 2891 2901 1
chr1 3241 3251 6
Upvotes: 3
Reputation: 195169
this one-liner should give the output:
awk -F'\t' -v OFS="\t" '{t=$1FS$2FS$3;if(!(t in a)||a[t]<$4)a[t]=$4}END{for(x in a) print x,a[x]}' file|sort
clear format:
awk -F'\t' -v OFS="\t" '{
t=$1FS$2FS$3
if(!(t in a)||a[t]<$4)
a[t]=$4
}
END{for(x in a) print x,a[x]}' file|sort
if you run with your data file:
kent$ cat file
chr1 1841 1851 4
chr1 1991 2001 3
chr1 2491 2501 2
chr1 2491 2501 2
chr1 2501 2511 1
chr1 2681 2691 3
chr1 2881 2891 4
chr1 2891 2901 1
chr1 3241 3251 1
chr1 3241 3251 6
kent$ awk -F'\t' -v OFS="\t" '{t=$1FS$2FS$3;if(!(t in a)||a[t]<$4)a[t]=$4}END{for(x in a) print x,a[x]}' file|sort
chr1 1841 1851 4
chr1 1991 2001 3
chr1 2491 2501 2
chr1 2501 2511 1
chr1 2681 2691 3
chr1 2881 2891 4
chr1 2891 2901 1
chr1 3241 3251 6
Upvotes: 1
Reputation: 781626
sort -k 1,3 -k 4r file | awk 'last != $1" "$2" "$3 { print; last = $1" "$2" "$3; }'
My output:
chr1 1841 1851 4
chr1 1991 2001 3
chr1 2491 2501 2
chr1 2501 2511 1
chr1 2681 2691 3
chr1 2881 2891 4
chr1 2891 2901 1
chr1 3241 3251 6
Upvotes: 1