Reputation: 28329
My data table looks like that:
chr4 124097568 124098568 337
chr4 159597106 159598106 1000
chr4 159597106 159598106 1000
chr4 164361532 164362532 455
chr4 164361532 164362532 74
chr4 164361532 164362532 2
chr4 170360150 170361150 0
I want to: Extract unique rows - if data in col#2 & col#3 is the same -> just the highest value (col#4) row should be extracted. If #2,#3 & #4 are identical just one of the rows should be extracted.
Preferred output is:
chr4 124097568 124098568 337
chr4 159597106 159598106 1000
chr4 164361532 164362532 455
chr4 170360150 170361150 0
If something is not clear I'll try to explain it more (cause I really need to solve this problem now).
Upvotes: 0
Views: 98
Reputation: 246774
awk '
{key = $2 SUBSEP $3}
!(key in max) || $4 > max[key] {max[key]=$4; line[key]=$0}
END {for (key in line) print line[key]}
'
Upvotes: 1
Reputation: 3194
one possible solution is to sort the lines of your input and then kill the lines which occur multiple times. In Python, you could do something like
f = open("table.dat", "r")
lines = f.read().split()
lines.sort()
old = lines[0]
singles = [old]
for line in lines:
if old != line:
singles.append(line)
old = line
else:
pass
f.close()
I am unaware of how to do this in bash.
Best regards, Sven
Upvotes: 0
Reputation: 42870
$ cat example.txt
chr4 124097568 124098568 337
chr4 159597106 159598106 1000
chr4 159597106 159598106 1000
chr4 164361532 164362532 455
chr4 164361532 164362532 74
chr4 164361532 164362532 2
chr4 170360150 170361150 0
$ sort --key=2 -g -u example.txt
chr4 124097568 124098568 337
chr4 159597106 159598106 1000
chr4 164361532 164362532 455
chr4 170360150 170361150 0
Upvotes: 3
Reputation: 661
That would be easier if the last column (COL#4) was "right space padded", like this:
chr4 124097568 124098568 337
chr4 159597106 159598106 1000
chr4 159597106 159598106 1000
chr4 164361532 164362532 455
chr4 164361532 164362532 74
chr4 164361532 164362532 2
chr4 170360150 170361150 0
That way, a combination of sort and uniq could do the trick.
Upvotes: 1