Reputation: 5759
My goal is to extract pieces of data from genome sequencing Fastq files and plot them. I would like to get the identifying information for each sequencing read and then two pieces of information about the read.
Below I have pasted two reads from a Fastq file for reference if it helps.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 12_S12_L001
chr1 115227813 . C G 2120.73 . AB=0.725;ABP=73.366;AC=1;AF=0.5;AN=2;AO=116;CIGAR=1X;DP=160;DPB=160;DPRA=0;EPP=254.901;EPPR=87.6977;GTI=0;LEN=1;MEANALT=3;MQM=60;MQMR=60;NS=
1;NUMALT=1;ODDS=152.168;PAIRED=0.991379;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=3761;QR=1366;RO=39;RPP=254.901;RPPR=87.6977;RUN=1;SAF=116;SAP=254.901;SAR=0;SRF=39;SRP=87.6977;SRR=0;TYPE=snp GT:DP:RO:QR:AO:Q
A:GL 0/1:160:39:1366:116:3761:-10,0,-10
chr1 115227814 . G A,C,T 8.27007e-12 . AB=0,0,0;ABP=0,0,0;AC=0,0,0;AF=0,0,0;AN=2;AO=120,11,35;CIGAR=1X,1X,1X;DP=84826;DPB=84826;DPRA=0,0,0;EPP=263.587,26.8965,79.0118;EPPR
=183840;GTI=0;LEN=1,1,1;MEANALT=3,3,3;MQM=60,60,60;MQMR=59.9996;NS=1;NUMALT=3;ODDS=115105;PAIRED=1,1,1;PAIREDR=0.990917;PAO=0,0,0;PQA=0,0,0;PQR=0;PRO=0;QA=4206,292,1061;QR=2822527;RO=84660;RPP=263.587,26.
8965,79.0118;RPPR=183840;RUN=1,1,1;SAF=120,11,35;SAP=263.587,26.8965,79.0118;SAR=0,0,0;SRF=84660;SRP=183840;SRR=0;TYPE=snp,snp,snp GT:DP:RO:QR:AO:QA:GL 0/0:84826:84660:2822527:120,11,35:4206,292,1
061:0,-10,-10,-10,-10,-10,-10,-10,-10,-10
Above, you can see that each read begins with the chromosome number from which the read was made, and the position of the read on that chromosome in column 1 and 2. In column 4 there is the reference base pair and column 5 contains the variant read. Then in column 8 there is a bunch of other information about the read, where each piece is separated by a semicolon.
The two number I care about here are whatever follows: RO=
and AO=
.
I would like to create an output file that contains just the information from Columns 1,2,4,5 and then put in the final column the fraction of AO/RO.
As an example of output starting with the first line, I would like the output of the following:
chr1 115227813 C G 0.74838
chr1 115227814 G A,C,T 0.00142
Where 0.74838 is calculated from RO=39 and AO=116 so 116/(39+116)=0.74838. And is calculated from RO=84660 and AO=120 so 120/(84660+120)=0.00142
Hopefully this clarifies the output I'm looking for.
Upvotes: 0
Views: 648
Reputation: 290225
This needed some research to find out how to do a kind of look-behind in awk
. It was interesting to discover it through a thread in google groups!
The idea is to use gensub()
to fetch the variable=value
in a given line and then print it back, removing the rest of the content of the line. So if we have hello hello;AO=23;bla bla bla
we then just get 23
.
awk 'v {
ro=gensub(/^.*;RO=([0-9]*).*$/, "\\1", "1");
printf "%s %f\n", f, (ao/(ao + ro)); v=0
}
/^chr/ {ao=gensub(/^.*;AO=([0-9]*).*$/,"\\1", "1");
v=1;
f=$1 FS $2 FS $4 FS $5
}' file
Basically, we look for lines starting with chr
. In these, we catch the 1st, 2nd, 4rd and 5th values. Then, we catch whatever is next to AO=
(only digits).
Since RO=
appears in the next line, we set a flag to search for it when reading the next line. Then, we get that value and print the full set of data. Finally we unset the flag so we start with the loop again.
$ awk 'v {ro=gensub(/^.*;RO=([0-9]*).*$/, "\\1", "1"); printf "%s %f\n", f, (ao/(ao + ro)); v=0} /^chr/ {ao=gensub(/^.*;AO=([0-9]*).*$/,"\\1", "1"); v=1; f=$1 FS $2 FS $4 FS $5}' a
chr1 115227813 C G 0.748387
chr1 115227814 G A,C,T 0.001415
Upvotes: 2