merv
merv

Reputation: 76950

How to work around open files limit when demuxing files?

I frequently have large text files (10-100GB decompressed) to demultiplex based on barcodes in each line, where in practice the number of resulting individual files (unique barcodes) is between 1K and 20K. I've been using awk for this and it accomplishes the task. However, I've noticed that the rate of demuxing larger files (which correlates with more unique barcodes used) is significantly slower (10-20X). Checking ulimit -n shows 4096 as the limit on open files per process, so I suspect that the slowdown is due to the overhead of awk being forced to constantly close and reopen files whenever the total number of demuxed files exceeds 4096.

Lacking root access (i.e., the limit is fixed), what kinds of workarounds could be used to circumvent this bottleneck?

I do have a list of all barcodes present in each file, so I've considered forking multiple awk processes where each is assigned a mutually exclusive subset (< 4096) of barcodes to search for. However, I'm concerned the overhead of having to check each line's barcode for set membership might defeat the gains of not closing files.

Is there a better strategy?

I'm not married to awk, so approaches in other scripting or compiled languages are welcome.


Specific Example

Data Generation (FASTQ with barcodes)

The following generates data similar to what I'm specifically working with. Each entry consists of 4 lines, where the barcode is an 18 character word using the non-ambiguous DNA alphabet.

1024 unique barcodes | 1 million reads

cat /dev/urandom | tr -dc "ACGT" | fold -w 5 | \
awk '{ print "@batch."NR"_"$0"AAAAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.1K.fastq

16384 unique barcodes | 1 million reads

cat /dev/urandom | tr -dc "ACGT" | fold -w 7 | \
awk '{ print "@batch."NR"_"$0"AAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.16K.fastq

awk script for demultiplexing

Note that in this case I'm writing to 2 files for each unique barcode.

demux.awk

#!/usr/bin/awk -f
BEGIN {
    if (length(outdir) == 0 || length(prefix) == 0) {
        print "Variables 'outdir' and 'prefix' must be defined!" > "/dev/stderr";
        exit 1;
    }
    print "[INFO] Initiating demuxing..." > "/dev/stderr";
}
{
    if (NR%4 == 1) {
        match($1, /.*_([ACGT]{18})_([ACGTN]{6}).*/, bx);
        print bx[2] >> outdir"/"prefix"."bx[1]".umi";
    }
    print >> outdir"/"prefix"."bx[1]".fastq";

    if (NR%40000 == 0) {
        printf("[INFO] %d reads processed\n", NR/4) > "/dev/stderr";
    }
}
END {
    printf("[INFO] %d total reads processed\n", NR/4) > "/dev/stderr";
}

Usage

awk -v outdir="/tmp/demux1K" -v prefix="batch" -f demux.awk cells.1K.fastq

or similarly for the cells.16K.fastq.

Assuming you're the only one running awk, you can verify the approximate number of open files using

lsof | grep "awk" | wc -l

Observed Behavior

Despite the files being the same size, the one with 16K unique barcodes runs 10X-20X slower than the one with only 1K unique barcodes.

Upvotes: 3

Views: 177

Answers (1)

Ed Morton
Ed Morton

Reputation: 203985

Without seeing any sample input/output or the script you're currently executing it's very much guesswork but if you currently have the barcode in field 1 and are doing (assuming GNU awk so you don't have your own code managing the open files):

awk '{print > $1}' file

then if managing open files really is your problem you'll get a significant improvement if you change it to:

sort file | '$1!=f{close(f};f=$1} {print > f}'

The above is, of course, making assumptions about what these barcoode values are, which field holds them, what separates fields, whether or not the output order has to match the original, what else your code might be doing that gets slower as the input grows, etc., etc. since you haven't shown us any of that yet.

If that's not all you need then edit your question to include the missing MCVE.


Given your updated question with your script and the info that the input is 4-line blocks, I'd approach the problem by adding the key "bx" values at the front of each record and using NUL to separate the 4-line blocks then using NUL as the record separator for sort and the subsequent awk:

$ cat tst.sh
infile="$1"
outdir="${infile}_out"
prefix="foo"

mkdir -p "$outdir" || exit 1

awk -F'[_[:space:]]' -v OFS='\t' -v ORS= '
    NR%4 == 1 { print $2 OFS $3 OFS }
    { print $0 (NR%4 ? RS : "\0") }
' "$infile" |
sort -z |
awk -v RS='\0' -F'\t' -v outdir="$outdir" -v prefix="$prefix" '
BEGIN {
    if ( (outdir == "") || (prefix == "") ) {
        print "Variables \047outdir\047 and \047prefix\047 must be defined!" | "cat>&2"
        exit 1
    }
    print "[INFO] Initiating demuxing..." | "cat>&2"
    outBase = outdir "/" prefix "."
}
{
    bx1   = $1
    bx2   = $2
    fastq = $3

    if ( bx1 != prevBx1 ) {
        close(umiOut)
        close(fastqOut)
        umiOut   = outBase bx1 ".umi"
        fastqOut = outBase bx1 ".fastq"
        prevBx1  = bx1
    }

    print bx2   > umiOut
    print fastq > fastqOut

    if (NR%10000 == 0) {
        printf "[INFO] %d reads processed\n", NR | "cat>&2"
    }
}
END {
    printf "[INFO] %d total reads processed\n", NR | "cat>&2"
}
'

When run against input files generated as you describe in your question:

$ wc -l cells.*.fastq
4000000 cells.16K.fastq
4000000 cells.1K.fastq

the results are:

$ time ./tst.sh cells.1K.fastq 2>/dev/null

real    0m55.333s
user    0m56.750s
sys     0m1.277s

$ ls cells.1K.fastq_out | wc -l
2048

$ wc -l cells.1K.fastq_out/*.umi | tail -1
1000000 total

$ wc -l cells.1K.fastq_out/*.fastq | tail -1
4000000 total


$ time ./tst.sh cells.16K.fastq 2>/dev/null

real    1m6.815s
user    0m59.058s
sys     0m5.833s

$ ls cells.16K.fastq_out | wc -l
32768

$ wc -l cells.16K.fastq_out/*.umi | tail -1
1000000 total

$ wc -l cells.16K.fastq_out/*.fastq | tail -1
4000000 total

Upvotes: 4

Related Questions