bli
bli

Reputation: 8184

'InputFiles' object has no attribute <X> when using a function as input for a snakemake rule

I have a snakemake workflow where some rules have a complex function as input:

def source_fold_data(wildcards):
    fold_type = wildcards.fold_type
    if fold_type in {"log2FoldChange", "lfcMLE"}:
        if hasattr(wildcards, "contrast_type"):
            # OPJ is os.path.join
            return expand(
                OPJ(output_dir, aligner, "mapped_C_elegans",
                    "deseq2_%s" % size_selected, "{contrast}",
                    "{contrast}_{{small_type}}_counts_and_res.txt"),
                contrast=contrasts_dict[wildcards.contrast_type])
        else:
            return rules.small_RNA_differential_expression.output.counts_and_res
    elif fold_type == "mean_log2_RPKM_fold":
        if hasattr(wildcards, "contrast_type"):
            # This is the branch used when I have the AttributeError
            #https://stackoverflow.com/a/26791923/1878788
            return [filename.format(wildcards) for filename in expand(
                OPJ(output_dir, aligner, "mapped_C_elegans",
                    "RPKM_folds_%s" % size_selected, "{contrast}",
                    "{contrast}_{{0.small_type}}_RPKM_folds.txt"),
                contrast=contrasts_dict[wildcards.contrast_type])]
        else:
            return rules.compute_RPKM_folds.output.fold_results
    else:
        raise NotImplementedError("Unknown fold type: %s" % fold_type)

The above function is used as input for two rules:

rule make_gene_list_lfc_boxplots:
    input:
        data = source_fold_data,
    output:
        boxplots = OPJ(output_dir, "figures", "{contrast}",
            "{contrast}_{small_type}_{fold_type}_{gene_list}_boxplots.{fig_format}")
    params:
        id_lists = set_id_lists,
    run:
        data = pd.read_table(input.data, index_col="gene")
        lfcs = pd.DataFrame(
            {list_name : data.loc[set(id_list)][wildcards.fold_type] for (
                list_name, id_list) in params.id_lists.items()})
        save_plot(output.boxplots, plot_boxplots, lfcs, wildcards.fold_type)


rule make_contrast_lfc_boxplots:
    input:
        data = source_fold_data,
    output:
        boxplots = OPJ(output_dir, "figures", "all_{contrast_type}",
            "{contrast_type}_{small_type}_{fold_type}_{gene_list}_boxplots.{fig_format}")
    params:
        id_lists = set_id_lists,
    run:
        lfcs = pd.DataFrame(
            {f"{contrast}_{list_name}" : pd.read_table(filename, index_col="gene").loc[
                set(id_list)]["mean_log2_RPKM_fold"] for (
                    contrast, filename) in zip(contrasts_dict["ip"], input.data) for (
                        list_name, id_list) in params.id_lists.items()})
        save_plot(output.boxplots, plot_boxplots, lfcs, wildcards.fold_type)

The second one fails with 'InputFiles' object has no attribute 'data', and only in some cases: I ran the same workflow with two different configuration files, and the error happened in only one of the two, although this rule was executed in both cases, and the same branch of the input function was taken.

How can this happen if the rule has:

    input:
        data = ...

?

I suppose this has to do with what my source_fold_data returns, either the explicit output of another rule, either a "manually" constructed list of file names.

Upvotes: 1

Views: 1949

Answers (1)

bli
bli

Reputation: 8184

As @Colin suggested in the comments, the problem happens when the input function returns an empty list. This is the case here when contrasts_dict[wildcards.contrast_type] is an empty list, a condition indicating that there is actually no point in trying to generate the output of the rule make_contrast_lfc_boxplots. I avoided the situation by modifying the input section of the rule all as follows:

Old version:

rule all:
    input:
        # [...]
        expand(OPJ(output_dir, "figures", "all_{contrast_type}", "{contrast_type}_{small_type}_{fold_type}_{gene_list}_boxplots.{fig_format}"), contrast_type=["ip"], small_type=IP_TYPES, fold_type=["mean_log2_RPKM_fold"], gene_list=BOXPLOT_GENE_LISTS, fig_format=FIG_FORMATS),
        # [...]

New version:

if contrasts_dict["ip"]:
    ip_fold_boxplots = expand(OPJ(output_dir, "figures", "all_{contrast_type}", "{contrast_type}_{small_type}_{fold_type}_{gene_list}_boxplots.{fig_format}"), contrast_type=["ip"], small_type=IP_TYPES, fold_type=["mean_log2_RPKM_fold"], gene_list=BOXPLOT_GENE_LISTS, fig_format=FIG_FORMATS)
else:
    ip_fold_boxplots = []
rule all:
    input:
        # [...]
        ip_fold_boxplots,
        # [...]

Some tinkering with snakemake/rules.py show that, at some point, the data attribute exist for the input attribute of the Rule object named make_contrast_lfc_boxplots, and that this attribute is still the source_fold_data function. I suppose this is later evaluated and removed when it is an empty list, but I haven't been able to find where.

I suppose the empty input is not a problem when snakemake constructs the dependency graph between rules. The problem therefore only occurs during the execution of a rule.

Upvotes: 2

Related Questions