wm3
wm3

Reputation: 113

Speed up read large file in parallel using Python

I need to process two large files (> 1 billion lines) and split each file into small files based on the information in specific lines in one file. The files record high throughput sequencing data in blocks (we say sequencing reads), while each read contains 4 lines, (name, sequence, n, quality). The read records are in the same order in two files.

to-do

split file1.fq based on the id field in file2.fq,

The two files looks like this:

$ head -n 4 file1.fq
@name1_1
ACTGAAGCGCTACGTCAT
+
A#AAFJJJJJJJJFJFFF

$ head -n 4 file2.fq
@name1_2
TCTCCACCAACAACAGTG
+
FJJFJJJJJJJJJJJAJJ

I wrote the following python function to do this job:

def p7_bc_demx_pe(fn1, fn2, id_dict):
    """Demultiplex PE reads, by p7 index and barcode"""
    # prepare writers for each small files
    fn_writer = {}
    for i in id_dict:
        fn_writer[i] = [open(id_dict[i] + '.1.fq', 'wt'),
            open(id_dict[i] + '.2.fq', 'wt')]

    # go through each record in two files
    with open(fn1, 'rt') as f1, open(fn2, 'rt') as f2:
        while True:
            try:
                s1 = [next(f1), next(f1), next(f1), next(f1)]
                s2 = [next(f2), next(f2), next(f2), next(f2)]
                tag = func(s2) # a function to classify the record
                fn_writer[tag][0].write(''.join(s1))
                fn_writer[tag][1].write(''.join(s2))
            except StopIteration:
                break
    # close writers
    for tag in p7_bc_writer: 
        fn_writer[tag][0].close() # close writers
        fn_writer[tag][1].close() # close writers

Question

Is there any way to speed up this process? (above function is much too slow)

How about split large file into chunks with specific lines (like f.seek()), and run the process in parallel with multiple cores?

EDIT-1

A total of 500 Million reads in each file (~180 GB in size). The bottleneck is reading and writing file. The following is my current solution (it works, but definitely not the best)

I first split the big file into smaller files using shell command: split -l (takes ~3 hours).

Then, apply the functions to 8 small files in parallel (takes ~1 hours)

Finally, merge the results (takes ~ 2 hours)

not trying PySpark yet, thanks @John H

Upvotes: 2

Views: 355

Answers (1)

John R
John R

Reputation: 1508

Look into Spark. You can spread your file across a cluster for much faster processing. There is a python API: pyspark.

https://spark.apache.org/docs/0.9.0/python-programming-guide.html

This also gives you the advantage of actually executing Java code, which doesn't suffer from the GIL and allowing for true multi-threading.

Upvotes: 1

Related Questions