Sorting, merging, and deduplicating large txt files on HPC?

Question

I am looking for some advice

I am currently creating a kmer database and looking to merge/sort and take uniq lines from 47 sample.txt.gz which are 16gb each, What would be the fastest way to do this.

i currently running this:

zcat *.merged.kmers.txt.gz | sort --parallel=48 --buffer-size= 1400G | uniq | gzip > all_unique_kmers.txt.gz

i have been running it a slurm but I wanted to know what parameters and what would someone else do, its been running 4 days!!!!

47 samples, 16gb compressed, 80gb uncompressed,

merge, sort, deduplicate

please someone help me...

Sorting, merging, and deduplicating large txt files on HPC?

Answers (0)

Related Questions