Reputation: 23
I would like to know if there is a way to have as output a single file where each column has some computation numbers taken from multiple files. My input is:
@SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
AGTAAAGGGACTCGGTCTCCTTCCATTGGAGGTTGTTTTCTAGGCTCAACAC
+SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
?;=ADDDDF@C3ACE:E?FED+CF>AABGFFB:?10?:BDDFB?@3BFFEEF
@SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
TTGATAGGGGAGATGCTAGCAAAAAGGTGTACTTCTCAGCGGAGCAGAAAGA
+SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
CCCFFFFFHHHHHIHIGHIIIGGIHII?DGHIIIIIIEHCHIIIIIIHIHHI
@SRR1544694.3 Run0199_AC237YACXX_L2_T1101_C54 length=52
TTTTTGGGGGGGAATTCTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCA
The aim is to count the percentage of G and C elements in the lines in the ATGC lines (second row and every 4 rows). The real files will have millions of lines. The expected output should be:
File1 File2
48.0769 48.0769
46.1538 46.1538
42.3077 42.3077
32.6923 32.6923
51.9231 51.9231
42.3077 42.3077
I have tried the code below. It outputs the calculations done in specific lines, to a single file matching each original file. If the output is not defined, it will print a single column.
awk '
FNR==1{ # first record of an input file?
if(o)close(o); # was previous output file? close it
o=FILENAME;sub(/\.fastq/,"_sorted.txt",o) # new output file name
}
{
if(NR%4==2){n=length($1); gc=gsub("[gcGC]", "", $1); print gc/n*100 >o}
}
' *.fastq
I would like to know if there is a way, using awk (especially to learn the tool) to have all the calculations in a single file, column separated.
Upvotes: 1
Views: 206
Reputation: 141493
With the following I recreated the input files:
cat >1.fastq <<EOF
@SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
AGTAAAGGGACTCGGTCTCCTTCCATTGGAGGTTGTTTTCTAGGCTCAACAC
+SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
?;=ADDDDF@C3ACE:E?FED+CF>AABGFFB:?10?:BDDFB?@3BFFEEF
@SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
TTGATAGGGGAGATGCTAGCAAAAAGGTGTACTTCTCAGCGGAGCAGAAAGA
+SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
CCCFFFFFHHHHHIHIGHIIIGGIHII?DGHIIIIIIEHCHIIIIIIHIHHI
@SRR1544694.3 Run0199_AC237YACXX_L2_T1101_C54 length=52
TTTTTGGGGGGGAATTCTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCA
EOF
cat >2.fastq <<EOF
@SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
AGTAAAGGGACTCGGTCTCCTTCCATTGGAGGTTGTTTTCTAGGCTCAACAC
+SRR1544694.1 Run0199_AC237YACXX_L2_T1101_C27 length=52
?;=ADDDDF@C3ACE:E?FED+CF>AABGFFB:?10?:BDDFB?@3BFFEEF
@SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
TTGATAGGGGAGATGCTAGCAAAAAGGTGTACTTCTCAGCGGAGCAGAAAGA
+SRR1544694.2 Run0199_AC237YACXX_L2_T1101_C28 length=52
CCCFFFFFHHHHHIHIGHIIIGGIHII?DGHIIIIIIEHCHIIIIIIHIHHI
@SRR1544694.3 Run0199_AC237YACXX_L2_T1101_C54 length=52
TTTTTGGGGGGGAATTCTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCA
EOF
The following script with comments:
# print the headers
printf "%s\n" *.fastq | paste -sd' '
# merge all files
paste *.fastq |
# for each line for each the field print the counts
awk 'NR % 4 == 2{
for (i = 1; i <= NF; ++i) {
n = length($i); gc = gsub("[gcGC]", "", $i); res = gc/n*100
printf res (i == NF ? "\n" : "\t")
}
}' "$i"
will generate the following output (both files are the same, it's just to illustrate it works):
1.fastq 2.fastq
48.0769 48.0769
46.1538 46.1538
42.3077 42.3077
I didn't modify the logic behind calculating the percentages, only modified how the results are merged.
I leave my other tries to do it below:
Alternatively with the following script, that folds the result using paste
not inside awk
:
output=""
for i in *.fastq; do
output=$(
printf "%s\n" "$output" |
# merge line-wise output with another results
paste - <(
echo "$i"
awk '{ if(NR%4==2){n=length($1); gc=gsub("[gcGC]", "", $1); print gc/n*100} }' "$i"
)
)
done
# in output theere will be leading tabs
# remove them
output=$(printf "%s" "$output" | cut -f2-)
echo "$output"
If there are really million of lines and files, using temporary file for paste
output instead of storing it in a variable will save ton of memory.
Alternatively it to generate all results line-wise and then for example transpose the output using awk script from stackoverflow:
for i in *.fastq; do
printf "%s" "$i"
awk '{ if(NR%4==2){n=length($1); gc=gsub("[gcGC]", "", $1); printf " " gc/n*100} }' "$i"
echo
done |
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}'
Upvotes: 0