Reputation: 47
I'm trying to calculate if the count for each row is more than a certain value, 30% of the total counts.
Within a for cycle, I've obtained the percentage in awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value
and that's a single number, the output only contains that.
How do I make the calculation "value is greater than" for each row of ${i}_counts
against ${i}_percentage-value
?
In other words, how to use the number inside the file as a numerical value for a math operation?
Data:
data.csv (an extract)
SampleID ASV Count
1000A ASV_1216 14
1000A ASV_12580 150
1000A ASV_12691 260
1000A ASV_135 434
1000A ASV_147 79
1000A ASV_15 287
1000A ASV_16 361
1000A ASV_184 8
1000A ASV_19 42
samples-ID-short
1000A
1000B
1000C
So for each sample ID, there's a lot of ASV, a quantity that may vary a lot like 50 ASV for 1000A, 120 for 1000B and so on. Every ASV_## has a count and my code is for calculating the count total sum, then finding out which is the 30% value for each sample, report which ASV_## is greater than 30%. Ultimately, it should report a 0 for <30% and 1 for >30%.
Here's my code so far:
for i in $(cat samplesID-short)
do
grep ${i} data.csv | cut -d , -f3 - > ${i}_count_sample
grep ${i} data.csv | cut -d , -f2 - > ${i}_ASV
awk '{ sum += $1; } END { print sum; }' ${i}_count_sample > ${i}_counts
awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value
#I was thinking about replicate the numeric value for the entire column and make the comparison "greater than", but the repetition times depend on the ASV counts for each sample, and they are always different.
wc -l ${i}_ASV > n
for (( c=1; c<=n; c++)) ; do echo ${i}_percentage-value ; done
paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_percentage-value > ${i}_tmp;
awk 'BEGIN{OFS="\t"}{if($2 >= $3) print $1}' ${i}_tmp > ${i}_is30;
#How the output should be:
paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_is30 > ${i}_summary_nh
echo -e "ASV_ID\tASV_in_sample\ttotal_ASVs_inSample\ttreshold_for_30%\tASV_over30%" | cat - ${i}_summary_nh > ${i}_summary
rm ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_ASV ${i}_summary_nh ${i}_is30
done &
Upvotes: 0
Views: 664
Reputation: 22012
Would you please try the following:
awk -v OFS="\t" '
NR==FNR { # this block is executed in the 1st pass only
if (FNR > 1) sum[$1] += $3
# accumulate the "count" for each "SampleID"
next
}
# the following block is executed in the 2nd pass only
FNR > 1 { # skip the header line
if ($1 != prev_id) {
# SampleID has changed. then update the output filename and print the header line
if (outfile) close(outfile)
# close previous outfile
outfile = $1 "_summary"
print "ASV_ID", "ASV_in_sample", "total_ASVs_inSample", "treshold_for_30%", "ASV_over30%" >> outfile
prev_id = $1
}
mark = ($3 > sum[$1] * 0.3) ? 1 : 0
# set the mark to "1" if the "Count" exceeds 30% of sum
print $2, $3, sum[$1], sum[$1] * 0.3, mark >> outfile
# append the line to the summary file
}
' data.csv data.csv
data.csv:
SampleID ASV Count
1000A ASV_1216 14
1000A ASV_12580 150
1000A ASV_12691 260
1000A ASV_135 434
1000A ASV_147 79
1000A ASV_15 287
1000A ASV_16 361
1000A ASV_184 8
1000A ASV_19 42
1000B ASV_1 90
1000B ASV_2 90
1000B ASV_3 20
1000C ASV_4 100
1000C ASV_5 10
1000C ASV_6 10
In the following output examples, the last field ASV_over30%
indicates 1
if the count exceeds 30% of the sum value.
1000A_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_1216 14 1635 490.5 0
ASV_12580 150 1635 490.5 0
ASV_12691 260 1635 490.5 0
ASV_135 434 1635 490.5 0
ASV_147 79 1635 490.5 0
ASV_15 287 1635 490.5 0
ASV_16 361 1635 490.5 0
ASV_184 8 1635 490.5 0
ASV_19 42 1635 490.5 0
1000B_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_1 90 200 60 1
ASV_2 90 200 60 1
ASV_3 20 200 60 0
1000C_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_4 100 120 36 1
ASV_5 10 120 36 0
ASV_6 10 120 36 0
[Explanations]
When calculating the average of the input data, we need to go through until the end of the data. If we want to print out the input record and the average value (or other information based on the average) at the same time, we need to use a trick:
As awk
is suitable for reading multiple files changing the proceduce
depending the order of files, I have picked the 2nd method.
NR==FNR
returns TRUE
while reading the 1st file only.
We calculate the sum of count
field within this block as a 1st pass.next
statement at the end of the block skips the following codes.NR==FNR
no longer returns
TRUE
and the 1st block is skipped.Upvotes: 0
Reputation: 1498
You can filter on a column based on a value e.g
$ awk '$3>300' data.csv
SampleID ASV Count
1000A ASV_135 434
1000A ASV_16 361
You can use >= for greater than or equal to.
It looks like your script is overcomplicating matters.
Upvotes: 2
Reputation: 67487
this should work
$ awk 'NR==1 || $3>$1*3/10' file
SampleID ASV Count
1000A ASV_135 434
1000A ASV_16 361
or, with the indicator column
$ awk 'NR==1{print $0, "Ind"} NR>1{print $0, ($3>$1*3/10)}' file | column -t
SampleID ASV Count Ind
1000A ASV_1216 14 0
1000A ASV_12580 150 0
1000A ASV_12691 260 0
1000A ASV_135 434 1
1000A ASV_147 79 0
1000A ASV_15 287 0
1000A ASV_16 361 1
1000A ASV_184 8 0
1000A ASV_19 42 0
Upvotes: 1