Sarah
Sarah

Reputation: 67

bash: extract values into table

I have ~200 text files, around 10Kb in size each, and all named fastqc_data.txt, each in a different subdirectory. The files were generated by a third party. The top of each file is shown below. My aim is to generate a new file, the first column of which will contain the "Filename" value (in this example "1265-H19_AGGCAG_L007_R1_001.fastq", the second column will contain the "Total sequences" value ("41284554"), and the third column will contain the value for "Sequence length" ("100").

Example input file 1:

FastQC 0.10.1  
Basic Statistics pass       
Measure        Value   
Filename        1265-H19_AGGCAG_L007_R1_001.fastq       
File type       Conventional base calls 
Encoding        Sanger / Illumina 1.9   
Total Sequences 41284554        
Filtered Sequences      0       
Sequence length 100     
%GC     41      
END_MODULE

Example output file:

Filename Total.Sequences Sequence.length  
1265-H19_AGGCAG_L007_R1_001.fastq 41284554 100  
1265-H20_TTTCAG_L007_R1_001.fastq 51387564 103  
1265-H21_CGGTTG_L007_R1_001.fastq 33254771 96

Upvotes: 1

Views: 137

Answers (1)

Tom Fenech
Tom Fenech

Reputation: 74595

You could transform your input into a row of output using an awk script like this:

awk 'BEGIN{print "Filename Total.Sequences Sequence.length"}
     /^Filename/{fn=$2}
     /^Total Sequences/{ts=$3}
     /^Sequence length/{print fn,ts,$3}' input_file

The BEGIN block is executed before any lines of your file are processed. When the other patterns are matched, the fields are saved to the variables fn and ts, to be used later. When the final pattern matches, the line is printed.

Of course, this makes a number of assumptions, such as that all the files contain the data in the same order.

Depending on the details of your directory structure and assuming that your shell supports it, you may be able to pass all of the files to the script like awk '...' **/fastqc_data.txt. This uses the "globstar" shell feature to recursively match all files with the name fastqc_data.txt and pass them all to the awk script.

Upvotes: 1

Related Questions