pogibas
pogibas

Reputation: 28329

Bash - extract info from table according to specific characteristics

My data table looks like that:

chr4    124097568       124098568       337
chr4    159597106       159598106       1000   
chr4    159597106       159598106       1000 
chr4    164361532       164362532       455
chr4    164361532       164362532       74
chr4    164361532       164362532       2
chr4    170360150       170361150       0

I want to: Extract unique rows - if data in col#2 & col#3 is the same -> just the highest value (col#4) row should be extracted. If #2,#3 & #4 are identical just one of the rows should be extracted.

Preferred output is:

chr4    124097568       124098568       337
chr4    159597106       159598106       1000 
chr4    164361532       164362532       455
chr4    170360150       170361150       0

If something is not clear I'll try to explain it more (cause I really need to solve this problem now).

Upvotes: 0

Views: 98

Answers (5)

potong
potong

Reputation: 58381

This might work for you:

 sort -k4nr file | sort -uk2,3n

Upvotes: 1

glenn jackman
glenn jackman

Reputation: 246774

awk '
    {key = $2 SUBSEP $3}
    !(key in max) || $4 > max[key] {max[key]=$4; line[key]=$0}
    END {for (key in line) print line[key]}
'

Upvotes: 1

Sven Hager
Sven Hager

Reputation: 3194

one possible solution is to sort the lines of your input and then kill the lines which occur multiple times. In Python, you could do something like

f = open("table.dat", "r")
lines = f.read().split()
lines.sort()

old = lines[0]
singles = [old]
for line in lines:
    if old != line:
        singles.append(line)
        old = line
    else:
        pass

f.close()

I am unaware of how to do this in bash.

Best regards, Sven

Upvotes: 0

Anders Lindahl
Anders Lindahl

Reputation: 42870

$ cat example.txt
chr4    124097568       124098568       337
chr4    159597106       159598106       1000   
chr4    159597106       159598106       1000 
chr4    164361532       164362532       455
chr4    164361532       164362532       74
chr4    164361532       164362532       2
chr4    170360150       170361150       0

$ sort --key=2 -g -u example.txt 
chr4    124097568       124098568       337
chr4    159597106       159598106       1000   
chr4    164361532       164362532       455
chr4    170360150       170361150       0

Upvotes: 3

Robson França
Robson França

Reputation: 661

That would be easier if the last column (COL#4) was "right space padded", like this:

chr4    124097568       124098568        337
chr4    159597106       159598106       1000   
chr4    159597106       159598106       1000 
chr4    164361532       164362532        455
chr4    164361532       164362532         74
chr4    164361532       164362532          2
chr4    170360150       170361150          0

That way, a combination of sort and uniq could do the trick.

Upvotes: 1

Related Questions