Reputation: 67
I have ~200 text files, around 10Kb in size each, and all named fastqc_data.txt
, each in a different subdirectory. The files were generated by a third party. The top of each file is shown below. My aim is to generate a new file, the first column of which will contain the "Filename" value (in this example "1265-H19_AGGCAG_L007_R1_001.fastq", the second column will contain the "Total sequences" value ("41284554"), and the third column will contain the value for "Sequence length" ("100").
Example input file 1:
FastQC 0.10.1
Basic Statistics pass
Measure Value
Filename 1265-H19_AGGCAG_L007_R1_001.fastq
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 41284554
Filtered Sequences 0
Sequence length 100
%GC 41
END_MODULE
Example output file:
Filename Total.Sequences Sequence.length
1265-H19_AGGCAG_L007_R1_001.fastq 41284554 100
1265-H20_TTTCAG_L007_R1_001.fastq 51387564 103
1265-H21_CGGTTG_L007_R1_001.fastq 33254771 96
Upvotes: 1
Views: 137
Reputation: 74595
You could transform your input into a row of output using an awk script like this:
awk 'BEGIN{print "Filename Total.Sequences Sequence.length"}
/^Filename/{fn=$2}
/^Total Sequences/{ts=$3}
/^Sequence length/{print fn,ts,$3}' input_file
The BEGIN
block is executed before any lines of your file are processed. When the other patterns are matched, the fields are saved to the variables fn
and ts
, to be used later. When the final pattern matches, the line is printed.
Of course, this makes a number of assumptions, such as that all the files contain the data in the same order.
Depending on the details of your directory structure and assuming that your shell supports it, you may be able to pass all of the files to the script like awk '...' **/fastqc_data.txt
. This uses the "globstar" shell feature to recursively match all files with the name fastqc_data.txt
and pass them all to the awk script.
Upvotes: 1