Reputation: 475
I have a problem that I am trying to use awk to solve. It has application in selecting good quality single nucleaotide ploymorphisms (SNP) for placing on a SNP-chip, where there is a requirement that no SNP is within 60bp of another SNP. The file looks like this:
comp1008_seq1 20
comp1008_seq1 234
comp1008_seq1 260
comp1008_seq1 500
comp3044_seq1 300
comp3044_seq1 350
comp3044_seq1 460
comp3044_seq1 600
................
I want to only print records that are not within +-60 (based on field 2) when they are from the same component (based on field 1). Therefore, it doesn't matter if they are within +-60 when they are from different components (based on field 1). The output in the above example should look like this:
comp1008_seq1 20
comp1008_seq1 234
comp1008_seq1 500
comp3044_seq1 300
comp3044_seq1 460
comp3044_seq1 600
Upvotes: 0
Views: 65
Reputation: 3727
{
if ($1 != last1 || abs($2-last2) > 60 ) print
last1 = $1; last2 = $2
}
function abs(x){
return x > 0 ? x : -x
}
Upvotes: 3