Reputation: 76950
I frequently have large text files (10-100GB decompressed) to demultiplex based on barcodes in each line, where in practice the number of resulting individual files (unique barcodes) is between 1K and 20K. I've been using awk
for this and it accomplishes the task. However, I've noticed that the rate of demuxing larger files (which correlates with more unique barcodes used) is significantly slower (10-20X). Checking ulimit -n
shows 4096 as the limit on open files per process, so I suspect that the slowdown is due to the overhead of awk
being forced to constantly close and reopen files whenever the total number of demuxed files exceeds 4096.
Lacking root access (i.e., the limit is fixed), what kinds of workarounds could be used to circumvent this bottleneck?
I do have a list of all barcodes present in each file, so I've considered forking multiple awk
processes where each is assigned a mutually exclusive subset (< 4096) of barcodes to search for. However, I'm concerned the overhead of having to check each line's barcode for set membership might defeat the gains of not closing files.
Is there a better strategy?
I'm not married to awk
, so approaches in other scripting or compiled languages are welcome.
The following generates data similar to what I'm specifically working with. Each entry consists of 4 lines, where the barcode is an 18 character word using the non-ambiguous DNA alphabet.
1024 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 5 | \
awk '{ print "@batch."NR"_"$0"AAAAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.1K.fastq
16384 unique barcodes | 1 million reads
cat /dev/urandom | tr -dc "ACGT" | fold -w 7 | \
awk '{ print "@batch."NR"_"$0"AAAAAAAAAAA_ACGTAC length=1\nA\n+\nI" }' | \
head -n 4000000 > cells.16K.fastq
awk
script for demultiplexingNote that in this case I'm writing to 2 files for each unique barcode.
demux.awk
#!/usr/bin/awk -f
BEGIN {
if (length(outdir) == 0 || length(prefix) == 0) {
print "Variables 'outdir' and 'prefix' must be defined!" > "/dev/stderr";
exit 1;
}
print "[INFO] Initiating demuxing..." > "/dev/stderr";
}
{
if (NR%4 == 1) {
match($1, /.*_([ACGT]{18})_([ACGTN]{6}).*/, bx);
print bx[2] >> outdir"/"prefix"."bx[1]".umi";
}
print >> outdir"/"prefix"."bx[1]".fastq";
if (NR%40000 == 0) {
printf("[INFO] %d reads processed\n", NR/4) > "/dev/stderr";
}
}
END {
printf("[INFO] %d total reads processed\n", NR/4) > "/dev/stderr";
}
awk -v outdir="/tmp/demux1K" -v prefix="batch" -f demux.awk cells.1K.fastq
or similarly for the cells.16K.fastq
.
Assuming you're the only one running awk
, you can verify the approximate number of open files using
lsof | grep "awk" | wc -l
Despite the files being the same size, the one with 16K unique barcodes runs 10X-20X slower than the one with only 1K unique barcodes.
Upvotes: 3
Views: 177
Reputation: 203985
Without seeing any sample input/output or the script you're currently executing it's very much guesswork but if you currently have the barcode in field 1 and are doing (assuming GNU awk so you don't have your own code managing the open files):
awk '{print > $1}' file
then if managing open files really is your problem you'll get a significant improvement if you change it to:
sort file | '$1!=f{close(f};f=$1} {print > f}'
The above is, of course, making assumptions about what these barcoode values are, which field holds them, what separates fields, whether or not the output order has to match the original, what else your code might be doing that gets slower as the input grows, etc., etc. since you haven't shown us any of that yet.
If that's not all you need then edit your question to include the missing MCVE.
Given your updated question with your script and the info that the input is 4-line blocks, I'd approach the problem by adding the key "bx" values at the front of each record and using NUL to separate the 4-line blocks then using NUL as the record separator for sort and the subsequent awk:
$ cat tst.sh
infile="$1"
outdir="${infile}_out"
prefix="foo"
mkdir -p "$outdir" || exit 1
awk -F'[_[:space:]]' -v OFS='\t' -v ORS= '
NR%4 == 1 { print $2 OFS $3 OFS }
{ print $0 (NR%4 ? RS : "\0") }
' "$infile" |
sort -z |
awk -v RS='\0' -F'\t' -v outdir="$outdir" -v prefix="$prefix" '
BEGIN {
if ( (outdir == "") || (prefix == "") ) {
print "Variables \047outdir\047 and \047prefix\047 must be defined!" | "cat>&2"
exit 1
}
print "[INFO] Initiating demuxing..." | "cat>&2"
outBase = outdir "/" prefix "."
}
{
bx1 = $1
bx2 = $2
fastq = $3
if ( bx1 != prevBx1 ) {
close(umiOut)
close(fastqOut)
umiOut = outBase bx1 ".umi"
fastqOut = outBase bx1 ".fastq"
prevBx1 = bx1
}
print bx2 > umiOut
print fastq > fastqOut
if (NR%10000 == 0) {
printf "[INFO] %d reads processed\n", NR | "cat>&2"
}
}
END {
printf "[INFO] %d total reads processed\n", NR | "cat>&2"
}
'
When run against input files generated as you describe in your question:
$ wc -l cells.*.fastq
4000000 cells.16K.fastq
4000000 cells.1K.fastq
the results are:
$ time ./tst.sh cells.1K.fastq 2>/dev/null
real 0m55.333s
user 0m56.750s
sys 0m1.277s
$ ls cells.1K.fastq_out | wc -l
2048
$ wc -l cells.1K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.1K.fastq_out/*.fastq | tail -1
4000000 total
$ time ./tst.sh cells.16K.fastq 2>/dev/null
real 1m6.815s
user 0m59.058s
sys 0m5.833s
$ ls cells.16K.fastq_out | wc -l
32768
$ wc -l cells.16K.fastq_out/*.umi | tail -1
1000000 total
$ wc -l cells.16K.fastq_out/*.fastq | tail -1
4000000 total
Upvotes: 4