Reputation: 5759
I am trying to unzip fastq.gz files and then analyze the sequencing data within them. However, later analysis is dependent on preservation of line (line 1 from zipped file must be line 1 in unzipped file) in order within the unzipped files.
When I manually look at the files, it seems to me that line order is being preserved when using gunzip to unzip the fatsq.gz files (and I wouldn't expect anything else). However, downstream analysis fails because order has not been preserved from the original file. Am I missing something about the unzipping process?
It appears that something like the following is happening.
Sequencer writes data to fastq.txt:
line1
line2
line3
lin4
Then zips it into fastq.gz. I then unzip using gunzip and appear to get something like the following, where line order is disrupted:
line2
line1
line4
line3
Upvotes: 0
Views: 251
Reputation: 86333
A gzip
/gunzip
cycle should not - and we can be reasonably confident that it does not - modify the contents of a file. Moreover, data corruption and algorithmic bugs in this case normally show up as a whole bunch of garbage, not as neatly reordered text lines.
A few alternatives:
Your sequencer does not actually output those lines properly ordered in the first place.
If multiple uncompressed files are involved, it may be that your sequencer does the equivalent of gzip -c file* > fastq.gz
, with the input files being named file1 file2 ... file9 file10
. When file*
is expanded in alphabetic order for such files, file10
will be processed before file2
, thus messing-up the order in the output.
If multiple compressed files are involved then the same mistake may be happening when decompressing.
Upvotes: 1