Black No13
Black No13

Reputation: 41

samtools calmd is pretty slow

I am using "samtools calmd" to add MD tag back to BAM file. The size of original BAM is around 50Gb (whole genome sequence by using pacbio HIFI reads). The issue that I encountered is that the speed of "calmd" is incredibly slow! The jobs have already run 12 hours, and only 600MB BAM with MD tag are generated. In this way, 50GB BAM will take 30days to be finished!

Here is the code I used to add MD tag (very normal):

rule addMDTag:
    input:
        rules.pbmm2_alignment.output        
    output: 
        strBAMDir + "/pbmm2/v37/{wcReadsType}/Tmp/rawReads{readsIndex}.MD.bam"               
    params:
        ref = strRef
    threads:
        16
    log:
        strBAMDir + "/pbmm2/v37/{wcReadsType}/Log/rawReads{readsIndex}.MD.log"
    benchmark:
        strBAMDir + "/pbmm2/v37/{wcReadsType}/Benchmark/rawReads{readsIndex}.MD.benchmark.txt"
    shell:
        "samtools calmd -@ {threads} {input} {params.ref} -bAr > {output}"

The version of samtools I used is v1.10.

BTW, I use 16 cores to run calmd, however, it looks like the samtools is still using 1 core to run it:

top - 11:44:53 up 47 days, 20:35,  1 user,  load average: 2.00, 2.01, 2.00
Tasks: 1723 total,   3 running, 1720 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.8%us,  0.3%sy,  0.0%ni, 96.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  529329180k total, 232414724k used, 296914456k free,    84016k buffers
Swap: 12582908k total,    74884k used, 12508024k free, 227912476k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                       
 93137 lix33     20   0  954m 151m 2180 R 100.2  0.0 659:04.13 samtools 

May I know how to make calmd be much faster? Or is there any other tool that can do the same job more efficiently?

Thanks so much

Upvotes: 2

Views: 177

Answers (1)

Black No13
Black No13

Reputation: 41

After the collaboration with samtools maintenance team, this issue has been solved. The calmd will be super slow if the bam was unsorted. Therefore, always make sure the BAM has been sorted before run calmd.

See the details below:

Are your files name sorted, and does your reference have more than one entry? 
If so calmd will be switching between references all the time, 
which means it may be doing a lot of reference loading and not much MD calculation.

You may find it goes a lot faster if you position-sort the input, and then run it through calmd.

Upvotes: 2

Related Questions