Extracting specific information from a Fastq file for Sequencing Analysis

Question

My goal is to extract pieces of data from genome sequencing Fastq files and plot them. I would like to get the identifying information for each sequencing read and then two pieces of information about the read.

Below I have pasted two reads from a Fastq file for reference if it helps.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  12_S12_L001
chr1    115227813       .       C       G       2120.73 .       AB=0.725;ABP=73.366;AC=1;AF=0.5;AN=2;AO=116;CIGAR=1X;DP=160;DPB=160;DPRA=0;EPP=254.901;EPPR=87.6977;GTI=0;LEN=1;MEANALT=3;MQM=60;MQMR=60;NS=
1;NUMALT=1;ODDS=152.168;PAIRED=0.991379;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=3761;QR=1366;RO=39;RPP=254.901;RPPR=87.6977;RUN=1;SAF=116;SAP=254.901;SAR=0;SRF=39;SRP=87.6977;SRR=0;TYPE=snp  GT:DP:RO:QR:AO:Q
A:GL    0/1:160:39:1366:116:3761:-10,0,-10
chr1    115227814       .       G       A,C,T   8.27007e-12     .       AB=0,0,0;ABP=0,0,0;AC=0,0,0;AF=0,0,0;AN=2;AO=120,11,35;CIGAR=1X,1X,1X;DP=84826;DPB=84826;DPRA=0,0,0;EPP=263.587,26.8965,79.0118;EPPR
=183840;GTI=0;LEN=1,1,1;MEANALT=3,3,3;MQM=60,60,60;MQMR=59.9996;NS=1;NUMALT=3;ODDS=115105;PAIRED=1,1,1;PAIREDR=0.990917;PAO=0,0,0;PQA=0,0,0;PQR=0;PRO=0;QA=4206,292,1061;QR=2822527;RO=84660;RPP=263.587,26.
8965,79.0118;RPPR=183840;RUN=1,1,1;SAF=120,11,35;SAP=263.587,26.8965,79.0118;SAR=0,0,0;SRF=84660;SRP=183840;SRR=0;TYPE=snp,snp,snp      GT:DP:RO:QR:AO:QA:GL    0/0:84826:84660:2822527:120,11,35:4206,292,1
061:0,-10,-10,-10,-10,-10,-10,-10,-10,-10

Above, you can see that each read begins with the chromosome number from which the read was made, and the position of the read on that chromosome in column 1 and 2. In column 4 there is the reference base pair and column 5 contains the variant read. Then in column 8 there is a bunch of other information about the read, where each piece is separated by a semicolon.

The two number I care about here are whatever follows: RO= and AO=.

I would like to create an output file that contains just the information from Columns 1,2,4,5 and then put in the final column the fraction of AO/RO.

As an example of output starting with the first line, I would like the output of the following:

chr1    115227813    C    G    0.74838
chr1    115227814    G    A,C,T    0.00142

Where 0.74838 is calculated from RO=39 and AO=116 so 116/(39+116)=0.74838. And is calculated from RO=84660 and AO=120 so 120/(84660+120)=0.00142

Hopefully this clarifies the output I'm looking for.

fedorqui · Accepted Answer

This needed some research to find out how to do a kind of look-behind in awk. It was interesting to discover it through a thread in google groups!

The idea is to use gensub() to fetch the variable=value in a given line and then print it back, removing the rest of the content of the line. So if we have hello hello;AO=23;bla bla bla we then just get 23.

awk 'v {
         ro=gensub(/^.*;RO=([0-9]*).*$/, "\1", "1"); 
         printf "%s %f
", f, (ao/(ao + ro)); v=0
     }
     /^chr/ {ao=gensub(/^.*;AO=([0-9]*).*$/,"\1", "1");
             v=1;
             f=$1 FS $2 FS $4 FS $5
            }' file

Basically, we look for lines starting with chr. In these, we catch the 1st, 2nd, 4rd and 5th values. Then, we catch whatever is next to AO= (only digits). Since RO= appears in the next line, we set a flag to search for it when reading the next line. Then, we get that value and print the full set of data. Finally we unset the flag so we start with the loop again.

Test

$ awk 'v {ro=gensub(/^.*;RO=([0-9]*).*$/, "\1", "1"); printf "%s %f
", f, (ao/(ao + ro)); v=0} /^chr/ {ao=gensub(/^.*;AO=([0-9]*).*$/,"\1", "1"); v=1; f=$1 FS $2 FS $4 FS $5}' a
chr1 115227813 C G 0.748387
chr1 115227814 G A,C,T 0.001415

Extracting specific information from a Fastq file for Sequencing Analysis

Answers (1)

Test

Related Questions