Reputation: 113
I need to process two large files (> 1 billion lines) and split each file into small files based on the information in specific lines in one file.
The files record high throughput sequencing data in blocks
(we say sequencing reads
), while each read
contains 4 lines, (name
, sequence
, n
, quality
). The read
records are in the same order in two files.
to-do
split file1.fq
based on the id
field in file2.fq
,
The two files looks like this:
$ head -n 4 file1.fq
@name1_1
ACTGAAGCGCTACGTCAT
+
A#AAFJJJJJJJJFJFFF
$ head -n 4 file2.fq
@name1_2
TCTCCACCAACAACAGTG
+
FJJFJJJJJJJJJJJAJJ
I wrote the following python function to do this job:
def p7_bc_demx_pe(fn1, fn2, id_dict):
"""Demultiplex PE reads, by p7 index and barcode"""
# prepare writers for each small files
fn_writer = {}
for i in id_dict:
fn_writer[i] = [open(id_dict[i] + '.1.fq', 'wt'),
open(id_dict[i] + '.2.fq', 'wt')]
# go through each record in two files
with open(fn1, 'rt') as f1, open(fn2, 'rt') as f2:
while True:
try:
s1 = [next(f1), next(f1), next(f1), next(f1)]
s2 = [next(f2), next(f2), next(f2), next(f2)]
tag = func(s2) # a function to classify the record
fn_writer[tag][0].write(''.join(s1))
fn_writer[tag][1].write(''.join(s2))
except StopIteration:
break
# close writers
for tag in p7_bc_writer:
fn_writer[tag][0].close() # close writers
fn_writer[tag][1].close() # close writers
Question
Is there any way to speed up this process? (above function is much too slow)
How about split large file into chunks with specific lines
(like f.seek()), and run the process in parallel with multiple cores?
EDIT-1
A total of 500 Million reads in each file (~180 GB in size). The bottleneck is reading and writing
file. The following is my current solution (it works, but definitely not the best)
I first split the big file into smaller files using shell command: split -l
(takes ~3 hours).
Then, apply the functions to 8 small files in parallel (takes ~1 hours)
Finally, merge the results (takes ~ 2 hours)
not trying PySpark yet, thanks @John H
Upvotes: 2
Views: 355
Reputation: 1508
Look into Spark. You can spread your file across a cluster for much faster processing. There is a python API: pyspark.
https://spark.apache.org/docs/0.9.0/python-programming-guide.html
This also gives you the advantage of actually executing Java code, which doesn't suffer from the GIL and allowing for true multi-threading.
Upvotes: 1