Reputation: 15
I wrote a simple ChIP-seq pipeline in Snakemake using a JSON formatted config file and the dry-run ran as expected. After further reading on best practices, I switched to a yaml formatted config file and made what I thought were the appropriate changes, but now I'm getting a "string indices must be integers error".
The pipeline runs Trimmomatic, FastQC, Bowtie2 and MACS2 using wrappers as much as possible. I'm including only the Trimmomatic and FastQC code for simplicity since I think the issue is reading the config file. The config file contains samples (three csv files), directories (to create a consistent directory structure), an outfile base name, and sequence data (genome, etc).
config.yaml
---
samples:
sample_names:samples.csv
sample_files:files.csv
sample_comparisons:comps.csv
directories:
base_dir: /base/
sample_dir: Samples/
seq_dir: Raw_Sequences/
trim_dir: Sequences/
aln_dir: Alignments/
peak_dir: Peak_Calling/
logs_dir: Logs/
out_base: base
ref_seq_data:
genome:<genome directory>
bt2_index:<bowtie2 index directory>
...
Snakefile
import pandas as pd
shell.prefix("set -euo pipefail; ")
configfile: "config.yaml"
sample_names = config["samples"]["sample_names"]
sample_files = config["samples"]["sample_files"]
sample_comparisons = config["samples"]["sample_comparisons"]
base_dir = config["directories"]["base_dir"]
sample_dir = config["directories"]["base_dir"]+config["directories"]["sample_dir"]
seq_dir = config["directories"]["seq_dir"]
trim_dir = config["directories"]["sample_dir"]+config["directories"]["trim_dir"]
aln_dir = config["directories"]["sample_dir"]+config["directories"]["aln_dir"]
peak_dir = config["directories"]["sample_dir"]+config["directories"]["peak_dir"]
log_dir = config["directories"]["sample_dir"]+config["directories"]["logs_dir"]
genome = config["ref_seq_data"]["genome"]
out_base = config["out_base"]
samples = pd.read_csv(sample_names, index_col="sample")
files = pd.read_csv(sample_files, index_col = "sample")
comparisons = pd.read_csv(sample_comparisons)
rule all:
input:
expand(log_dir+"{sample}_{read}_fastqc.html", sample = samples, read = [1,2])
# Load Rules
include: "Snakemake_rules/NGS_QC.smk"
The error message I receive is:
TypeError in line 7 of Snakefile: string indices must be integers
When using the JSON formatted config file, I didn't have it broken up into groups (each line was standalone) and when calling these lines with config[], it correctly assigned the proper values.
Most of the discussion I've seen about this involves iteration, so I'm not sure why the error is occurring here when using the yaml formatted file.
Upvotes: 1
Views: 258
Reputation: 6584
The problem is in your config.yaml file.
samples:
sample_names:samples.csv
sample_files:files.csv
sample_comparisons:comps.csv
The keys and values shall be separated with colon followed with space. I guess that in your case YAML parser treats the samples
section not as a dictionary but as a list of strings: config["samples"] == ["sample_names:samples.csv", "sample_files:files.csv", "sample_comparisons:comps.csv"]
.
The correct config should be:
samples:
sample_names: samples.csv
sample_files: files.csv
sample_comparisons: comps.csv
Upvotes: 1