Reputation: 3372
Dear all i have one question.
I have the input like : (second column is only index)
chr1 1 30
chr1 2 40.5
chr1 3 30.5
chr1 4 41
chr2 10 60
chr2 15 40.1
And i want to get this:
chr1 chr2
30 - 31 2 0
31 - 32 0 0
...
40 - 41 1 1 etc..
I need categorize data to each group from 30 to 60 per 1. From the input data I count all rows for chr1 which are contain in in the category 30-31 from $3. I have this code, but I do not understand where is problem: (some problem with loop)
samtools view /home/filip/Desktop/AMrtin\ Hynek/54321Odfiltrovany.bam | awk '{ n=length($10); print $3,"\t",NR,"\t", gsub(/[GCCgcs]/,"",$10)/n;}' | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\t",$2,"\t",($3*100)}' > data.txt
for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
do
export $j
awk -v sop=$j '{if($1 == $sop) print $0}' data.txt |
awk '{d=int($3)
a[d]++
if (NR==1) {min=d}
min=(min>=d?d:min)
max=(max>d?max:d)}
END{for (i=min; i<=max; i++) print i, "-", i+1, a[i]+0}' ;
done
Part of code I made by help "fedorqui"
Upvotes: 0
Views: 242
Reputation: 2684
If you are using gawk, this should work. There's a filter for $1 that should handle everything you were doing with $j (unless you truly need only chr1..chr22, in which case it should still be possible to develop a regex for it).
BEGIN {
for(i = 30; i <= 60; i++) {
rstring = i " - " i + 1;
rows[rstring] = 0;
}
}
$1 ~ /^chr[0-9][0-9]?$/ {
row = int($3) " - " int($3) + 1;
columns[$1] = 0;
rows[row] = 0;
data[row][$1] += 1;
rowwidth = length(row) > rowwidth ? length(row) : rowwidth;
colwidth = length($1) > colwidth ? length($1) : colwidth;
}
END {
rowheader = "%-" (rowwidth * 2) "s";
colheader = "%" colwidth "s\t";
dataformat = "%" int(colwidth / 2) "d\t";
asorti(columns, sortedcolumns);
asorti(rows, sortedrows);
printf rowheader, "";
for(c in sortedcolumns) printf "%s\t", sortedcolumns[c];
print "";
for(r in sortedrows) {
printf rowheader, sortedrows[r];
for(c in sortedcolumns)
printf dataformat, data[sortedrows[r]][sortedcolumns[c]];
print ""
}
}
Running it with gawk -f [scriptfile from above] < data.txt
should produce something like:
chr1 chr2
30 - 31 2 0
31 - 32 0 0
. . .
39 - 40 0 0
40 - 41 1 1
41 - 42 1 0
42 - 43 0 0
. . .
59 - 60 0 0
60 - 61 0 1
Upvotes: 3
Reputation: 25
Following can be used if you want to use Perl
perl -ane '
$h{$F[0]}{int $F[2]}++;
push @range, int $F[2];
}{
@range = sort @range;
print "\t\t", join "\t", sort { $a cmp $b } keys %h; print "\n";
for $i ($range[0] .. $range[-1]) {
print "$i - ", $i + 1, "\t\t";
print $h{$_}{$i} + 0, "\t" for sort { $a cmp $b } keys %h; print "\n"
}' file
Output should be like this
chr1 chr2
30 - 31 2 0
31 - 32 0 0
32 - 33 0 0
33 - 34 0 0
34 - 35 0 0
35 - 36 0 0
36 - 37 0 0
37 - 38 0 0
38 - 39 0 0
39 - 40 0 0
40 - 41 1 1
41 - 42 1 0
42 - 43 0 0
43 - 44 0 0
44 - 45 0 0
45 - 46 0 0
46 - 47 0 0
47 - 48 0 0
48 - 49 0 0
49 - 50 0 0
50 - 51 0 0
51 - 52 0 0
52 - 53 0 0
53 - 54 0 0
54 - 55 0 0
55 - 56 0 0
56 - 57 0 0
57 - 58 0 0
58 - 59 0 0
59 - 60 0 0
Upvotes: 2
Reputation: 77145
Using awk
:
awk '
!($1 in chrs) { chr[++c] = $1 ; chrs[$1]++ }
{
val = int($3);
map[$1,val]++;
min = (NR==1?val:min>=val?val:min);
max = (max>val?max:val)
}
END {
printf "\t\t"
for (j=1; j<=c; j++) {
printf "%s%s", sep, chr[j]
sep = "\t"
}
print ""
for (i=min; i<=max; i++) {
printf "%d - %d\t", i, i+1
for (j=1; j<=c; j++) {
printf "\t%s", map[chr[j],i] + 0
}
print ""
}
}' file
chr1 chr2
30 - 31 2 0
31 - 32 0 0
32 - 33 0 0
...
38 - 39 0 0
39 - 40 0 0
40 - 41 1 1
41 - 42 1 0
42 - 43 0 0
...
59 - 60 0 0
60 - 61 0 1
chr
array by the order of chromosome seen. map
array that is indexed at chromosome and range having counts as its value. END
block we first iterate over our chr
array and print the chromosomesmin
and max
variables we create a loop and print the values from our map
array which is indexed at chromosome and the range. min
and ending at max
. Upvotes: 3
Reputation: 3838
First, you could use :
for j in {1..22}; do
chrj="char$j"
# now you could use $chrj instead of $j in this loop
done
Instead of :
for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 do # ... done
Then, you don't need to multiply calls to awk
and pipes. Only one awk
should be enough.
For example :
... | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\t",$2,"\t",($3*100)}'
Should be :
awk '($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}'
# or
awk '{if ($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}}'
Otherwise :
export $j
What is the purpose of this export
?
I haven't read everything on your code but at this point many optimizations must be done !
Upvotes: 3