jmai
jmai

Reputation: 15

Snakemake: How do I get a shell command running with different arguments (integer) in a rule?

I'm trying to research the best hyperparameters for my boosted decision tree training. Here's the code for just two instances:

user = '/home/.../BDT/'

nestimators = [1, 2]

rule all:
        input: user + 'AUC_score.pdf'

rule testing:
        output: user + 'AUC_score.csv'
        shell: 'python bdt.py --nestimators {}'.format(nestimators[i] for i in range(2))

rule plotting:
        input: user + 'AUC_score.csv'
        output: user + 'AUC_score.pdf'
        shell: 'python opti.py

The plan is as follows: I want to parallelize the training of my BDT with a bunch of different hyperparameters (for the beginning I just want to start with nestimators). Therefore I try to use the shellcommand to train the bdt. bdt.py gets the argument for training, trains and saves the hyperparameters + training score in a csv file. In the csv file I can look which hyperparameters give the best scores. Yej!

Sadly it doesn't work like that. I tried to use the input function but since I want to give an integer it does not work. I tried it the way you can see above but know I get an 'error message' : 'python bdt.py --nestimators <generator object at 0x7f5981a9d150>'. I understand why this doesn't work either but I don't know where to go from here.

Upvotes: 1

Views: 497

Answers (2)

wheat
wheat

Reputation: 116

The error arises because {} is replaced by a generator object, that is, it is not replaced first by 1 and then by 2 but, so to speak, by an iterator over nestimators.

Even if you correct the python expression in the rule testing. There may be a more fundamental problem if I understand your aim correctly. The workflows of snakemake are defined in terms of rules that define how to create output files from input files. Therefore, the function testing will be called only once, but probably you want to call the rule separately for each hyperparameter.

The solution will be to add the hyperparameter in the filename of the output. Something like this:

user = '/home/.../BDT/'

nestimators = [1, 2]

rule all:
        input: user + 'AUC_score.pdf'

rule testing:
        output: user + 'AUC_score_{hyper}.csv'
        shell: 'python bdt.py --nestimators {wildcards.hyper}'

rule plotting:
        input: expand(user + 'AUC_score_{hyper}.csv', hyper=nestimators)
        output: user + 'AUC_score.pdf'
        shell: 'python opti.py'

Finally, instead of using shell: to call a python script. You can directly used script: as explained in the documentation: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#external-scripts

Upvotes: 1

Dmitry Kuzminov
Dmitry Kuzminov

Reputation: 6584

The problem in your code is that the expression nestimators[i] for i in range(2) is not a list (as you may think). That is a generator, and it doesn't produce any values until you explicitly do that. For example, this code:

'python bdt.py --nestimators {}'.format(list(nestimators[i] for i in range(2)))

produces the result 'python bdt.py --nestimators [1, 2]'

Actually you don't need to have a generator at all, as this code produces exactly the same output:

'python bdt.py --nestimators {}'.format(nestimators)

This format probably is not what your script is expecting. For example, if you wish to get the command line like that: python bdt.py --nestimators 1,2, you may use this expression:

'python bdt.py --nestimators {}'.format(",".join(map(str, nestimators)))

The last expression could be reduced if you can use f-strings:

f'python bdt.py --nestimators {",".join(map(str, nestimators))}'

Upvotes: 0

Related Questions