Alexandre.S
Alexandre.S

Reputation: 107

Split multiple files into multiple parts with Snakemake

I’m building a pipeline that is supposed to take a list of files as input (located anywhere on the disk), split all these files into smaller pieces, and then do some computations on all of these pieces before merging the results. I’m strugling on the first step.

For exemple
Input files = A and B
A and B are split into 10 files : A1, A2, A3, A4… B9, B10.
Some computations is made on all of the subfiles : results_A1, results_A2… results_B10
The results are merged, with respect to the input file they came from. So we end up with
results_A_merged and results_B_merged

The tool that splits the files (seqkit split) takes the number of pieces I want to split a file in, the file that I want to split, an output dir, and output the splitted files in this output dir with a given pattern. If the input file is path/to/file_A.fasta, it will output : output_dir/file_A.part_001.fasta, output_dir/file_A.part_002.fasta etc.

I achieve to do that with one single file as input.

my_files="path/to/file_1.fasta"
my_files_dir=[]
my_files_prefix=[]
my_files_extension=[]

###Store the path to the dir, the file name without extension, and the extension.
for i in my_files:
    print(i)
    my_files_dir.append(re.search(r'(.*)/(.*)',i).group(1))
    my_files_prefix.append(re.search(r'(.*)/(.*)(\.[fna|fa|fasta])',i).group(2))
    my_files_extension.append(re.search(r'(.*)/(.*)(\.fna)',i).group(3)) ###FIXME: hard coded shit...

###Create the name of all the splited files
my_temp_fasta=[]    
for i in range(1,blast_jobs):
    my_temp_fasta.append(my_files_prefix[0]+'.part_%03d'%i+my_files_extension[0])

###Split my file.
rule split_fasta:
    input:
        my_files
    output:
        expand('splited_fasta/{tmp_fasta_files}', tmp_fasta_files=my_temp_fasta)
    params:
        num_sequences=10
    shell:
        "seqkit split --out-dir splited_fasta --by-part {params.num_sequences} {input}"

But as soon as I try with multiple files, I cannot even manage to split them correctly.

Here is my unworking pipeline, which has only one rule to try to split the files, for the moment.


my_files=["path/to/file_1.fasta", "other/path/to/file_2.fasta"]

my_files_dir=[]
my_files_prefix=[]
my_files_extension=[]

###Store the path to the dir, the file name without extension, and the extension of each files.
for i in my_files:
    print(i)
    my_files_dir.append(re.search(r'(.*)/(.*)',i).group(1))
    my_files_prefix.append(re.search(r'(.*)/(.*)(\.[fna|fa|fasta])',i).group(2))
    my_files_extension.append(re.search(r'(.*)/(.*)(\.fna)',i).group(3)) ###FIXME: hard coded shit...

#Store all the files that will be created by the split command.
tmp=[]
my_temp_fasta_dict={}
for j in range(0,len(my_files)):
    for i in range(1,10):
        tmp.append(my_files_prefix[j]+'.part_%03d'%i+my_files_extension[j])
    my_temp_fasta_dict[my_files_prefix[j]] = tmp
    tmp=[]

##So I have a (useless...) dictionary, with file name prefix as key, and a list of splited file names as values.

rule split_fasta:
    input:
        my_files
    output:
        expand('splited_fasta/{tmp_fasta_files}', tmp_fasta_files=my_temp_fasta_dict.values())
    params:
        num_sequences=10
    shell:
        "seqkit split --out-dir splited_fasta --by-part {params.num_sequences} {input}"

Which gives a wrong command, concatenating all my input files :

seqkit split --out-dir splited_fasta --by-part 5 path/to/file_1.fasta other/path/to/file_2.fasta

Instead of running the command two times on the two input files. I just cannot succeed doing that. And the worse thing it's that it's probably easy...

Thanks you in advance for you help.

Upvotes: 0

Views: 653

Answers (1)

Dmitry Kuzminov
Dmitry Kuzminov

Reputation: 6600

This is a common mistake for Snakemake users to start reasoning bottom-up (from input files to target). Try the top-down approach instead (start with the target, then think of what is needed to build this target, etc.):

rule all:
    input = expand("results_{sample}_merged", sample=["A", "B"])

rule merge:
    input = expand("output_dir/file_{{sample}}.part_00{n}.fasta", n=range(1,10))
    output = "results_{sample}_merged"

rule split:
    input = "{sample}"
    output = expand("output_dir/file_{{sample}}.part_00{n}.fasta", n=range(1,10))

Upvotes: 3

Related Questions