onetwoonexu
onetwoonexu

Reputation: 11

snakemake: list of pathes in input

I am sorry for low level question, I am junior. I try to learn snakemake along with click. Please, help me to understand, for this example, how can I put a list of pathes to input in rule? And get this list in python script.

Snakemake:

path_1 = 'data/raw/data2process/'
path_2 = 'data/raw/table.xlsx'
    rule:
        input:
             list_of_pathes = "list of all pathes to .xlsx/.csv/.xls files from path_1"
             other_table = path_2
        output:
             {some .xlsx file}
        shell:
             "script_1.py {input.list_of_pathes} {output}"
             "script_2.py {input.other_table} {output}"

script_1.py:

@click.command()
@click.argument(input_list_of_pathes, type=*??*)
@click.argument("out_path",  type=click.Path())
def foo(input_list_of_pathes: list, out_path: str):
    df = pd.DataFrame()
    for path in input_list_of_pathes:
        table = pd.read_excel(path)
        **do smthng**
        df = pd.concat([df, table])
    df.to_excel(out_path)

script_2.py:

@click.command()
@click.argument("input_path", type=type=click.Path(exist=True))
@click.argument("output_path",  type=click.Path())
def foo_1(input_path: str, output_path: str):
    table = pd.read_excel(input_path)
    **do smthng**
    table.to_excel(output_path)

Upvotes: 1

Views: 110

Answers (1)

bli
bli

Reputation: 8194

Using pathlib, and the glob method of a Path object, you could proceed as follows:

from itertools import chain
from pathlib import Path
path_1 = Path('data/raw/data2process/')
exts = ["xlsx", "csv", "xls"]
path_1_path_lists = [
    list(path_1.glob(f"*.{ext}"))
    for ext in exts]
path_1_all_paths = list(chain.from_iterable(path_1_dict.values()))

The chain.from_iterables allows to "flatten" the list of lists, but I'm not sure Snakemake even needs this for the input of its rules.

Then, in your rule:

input:
    list_of_paths = path_1_all_paths,
    other_table = path_2

I think that Path objects can be used directly. Otherwise, you need to turn them into strings with str:

input:
    list_of_paths = [str(p) for p in path_1_all_paths],
    other_table = path_2

Upvotes: 1

Related Questions