Reputation: 3372

Specific sorting to category by awk and bash

Dear all i have one question.

I have the input like : (second column is only index)

chr1 1 30

chr1 2 40.5

chr1 3 30.5

chr1 4 41

chr2 10 60

chr2 15 40.1

And i want to get this:

           chr1  chr2

30 - 31     2     0

31 - 32     0     0

...

40 - 41     1     1 etc..

I need categorize data to each group from 30 to 60 per 1. From the input data I count all rows for chr1 which are contain in in the category 30-31 from $3. I have this code, but I do not understand where is problem: (some problem with loop)

samtools view /home/filip/Desktop/AMrtin\ Hynek/54321Odfiltrovany.bam | awk '{ n=length($10); print $3,"\t",NR,"\t", gsub(/[GCCgcs]/,"",$10)/n;}'  | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\t",$2,"\t",($3*100)}' > data.txt

for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
do

export $j

awk -v sop=$j  '{if($1 == $sop) print $0}' data.txt |
awk '{d=int($3)
      a[d]++
      if (NR==1) {min=d}
      min=(min>=d?d:min)
      max=(max>d?max:d)}
      END{for (i=min; i<=max; i++) print i, "-", i+1, a[i]+0}' ;
done

Part of code I made by help "fedorqui"

Upvotes: 0

Answers (4)

Sean Summers

Reputation: 2684

If you are using gawk, this should work. There's a filter for $1 that should handle everything you were doing with $j (unless you truly need only chr1..chr22, in which case it should still be possible to develop a regex for it).

BEGIN {
  for(i = 30; i <= 60; i++) {
    rstring = i " - " i + 1;
    rows[rstring] = 0;
  }
}
$1 ~ /^chr[0-9][0-9]?$/ {
  row = int($3) " - " int($3) + 1;
  columns[$1] = 0;
  rows[row] = 0;
  data[row][$1] += 1;
  rowwidth = length(row) > rowwidth ? length(row) : rowwidth;
  colwidth = length($1) > colwidth ? length($1) : colwidth;
 }
END {
  rowheader = "%-" (rowwidth * 2) "s";
  colheader = "%" colwidth "s\t";
  dataformat = "%" int(colwidth / 2) "d\t";

  asorti(columns, sortedcolumns);
  asorti(rows, sortedrows);

  printf rowheader, "";
  for(c in sortedcolumns) printf "%s\t", sortedcolumns[c];
  print "";

  for(r in sortedrows) {
    printf rowheader, sortedrows[r];
    for(c in sortedcolumns)
      printf dataformat, data[sortedrows[r]][sortedcolumns[c]];
    print ""
  }
}

Running it with gawk -f [scriptfile from above] < data.txt should produce something like:

chr1 chr2 30 - 31 2 0 31 - 32 0 0 . . . 39 - 40 0 0 40 - 41 1 1 41 - 42 1 0 42 - 43 0 0 . . . 59 - 60 0 0 60 - 61 0 1

Upvotes: 3

anujpaliwal

Reputation: 25

Following can be used if you want to use Perl

perl -ane '
$h{$F[0]}{int $F[2]}++;
push @range, int $F[2]; 
}{
@range = sort @range; 
print "\t\t", join "\t", sort { $a cmp $b } keys %h; print "\n";
for $i ($range[0] .. $range[-1]) {
print "$i - ", $i + 1, "\t\t"; 
print $h{$_}{$i} + 0, "\t" for sort { $a cmp $b } keys %h; print "\n"
}' file

Output should be like this

                chr1    chr2
30 - 31         2       0
31 - 32         0       0
32 - 33         0       0
33 - 34         0       0
34 - 35         0       0
35 - 36         0       0
36 - 37         0       0
37 - 38         0       0
38 - 39         0       0
39 - 40         0       0
40 - 41         1       1
41 - 42         1       0
42 - 43         0       0
43 - 44         0       0
44 - 45         0       0
45 - 46         0       0
46 - 47         0       0
47 - 48         0       0
48 - 49         0       0
49 - 50         0       0
50 - 51         0       0
51 - 52         0       0
52 - 53         0       0
53 - 54         0       0
54 - 55         0       0
55 - 56         0       0
56 - 57         0       0
57 - 58         0       0
58 - 59         0       0
59 - 60         0       0

Upvotes: 2

jaypal singh

Reputation: 77145

Using awk:

awk '
!($1 in chrs) { chr[++c] = $1 ; chrs[$1]++ }
{
    val = int($3); 
    map[$1,val]++;
    min = (NR==1?val:min>=val?val:min);
    max = (max>val?max:val)
}
END {
    printf "\t\t"
    for (j=1; j<=c; j++) {
        printf "%s%s", sep, chr[j]
        sep = "\t"
    }
    print ""
    for (i=min; i<=max; i++) {
        printf "%d - %d\t", i, i+1
        for (j=1; j<=c; j++) {
            printf "\t%s", map[chr[j],i] + 0
        }
        print ""
    }
}' file
                chr1    chr2
30 - 31         2       0
31 - 32         0       0
32 - 33         0       0
...
38 - 39         0       0
39 - 40         0       0
40 - 41         1       1
41 - 42         1       0
42 - 43         0       0
...
59 - 60         0       0
60 - 61         0       1

You increment the chr array by the order of chromosome seen.
Rest stuff in the main block is pretty much your code except that we also create a map array that is indexed at chromosome and range having counts as its value.
In the END block we first iterate over our chr array and print the chromosomes
Then using our min and max variables we create a loop and print the values from our map array which is indexed at chromosome and the range.
I have truncated some lines from the output. As you can see from the output it will print all numbers starting at min and ending at max.

Upvotes: 3

Idriss Neumann

Reputation: 3838

First, you could use :

for j in {1..22}; do
    chrj="char$j"
    # now you could use $chrj instead of $j in this loop
done

Instead of :

for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
do
    # ...
done

Then, you don't need to multiply calls to awk and pipes. Only one awk should be enough.

For example :

... | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\t",$2,"\t",($3*100)}'

Should be :

awk '($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}'
# or
awk '{if ($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}}'

Otherwise :

export $j

What is the purpose of this export ?

I haven't read everything on your code but at this point many optimizations must be done !

Upvotes: 3

Specific sorting to category by awk and bash

Answers (4)

Related Questions