D. Joe
D. Joe

Reputation: 15

String indices error in snakefile when switching from JSON to yaml config file

I wrote a simple ChIP-seq pipeline in Snakemake using a JSON formatted config file and the dry-run ran as expected. After further reading on best practices, I switched to a yaml formatted config file and made what I thought were the appropriate changes, but now I'm getting a "string indices must be integers error".

The pipeline runs Trimmomatic, FastQC, Bowtie2 and MACS2 using wrappers as much as possible. I'm including only the Trimmomatic and FastQC code for simplicity since I think the issue is reading the config file. The config file contains samples (three csv files), directories (to create a consistent directory structure), an outfile base name, and sequence data (genome, etc).

config.yaml

---

samples:
  sample_names:samples.csv
  sample_files:files.csv
  sample_comparisons:comps.csv
directories:
  base_dir: /base/
  sample_dir: Samples/
  seq_dir: Raw_Sequences/
  trim_dir: Sequences/
  aln_dir: Alignments/
  peak_dir: Peak_Calling/
  logs_dir: Logs/
out_base: base
ref_seq_data:
  genome:<genome directory>
  bt2_index:<bowtie2 index directory>

...

Snakefile

import pandas as pd

shell.prefix("set -euo pipefail; ")

configfile: "config.yaml"

sample_names = config["samples"]["sample_names"]
sample_files = config["samples"]["sample_files"]
sample_comparisons = config["samples"]["sample_comparisons"]
base_dir = config["directories"]["base_dir"]
sample_dir = config["directories"]["base_dir"]+config["directories"]["sample_dir"]
seq_dir = config["directories"]["seq_dir"]
trim_dir = config["directories"]["sample_dir"]+config["directories"]["trim_dir"]
aln_dir = config["directories"]["sample_dir"]+config["directories"]["aln_dir"]
peak_dir = config["directories"]["sample_dir"]+config["directories"]["peak_dir"]
log_dir = config["directories"]["sample_dir"]+config["directories"]["logs_dir"]
genome = config["ref_seq_data"]["genome"]
out_base = config["out_base"]

samples = pd.read_csv(sample_names, index_col="sample")
files = pd.read_csv(sample_files, index_col = "sample")
comparisons = pd.read_csv(sample_comparisons)

rule all:
  input:
    expand(log_dir+"{sample}_{read}_fastqc.html", sample = samples, read = [1,2])

# Load Rules

include: "Snakemake_rules/NGS_QC.smk"

The error message I receive is:

TypeError in line 7 of Snakefile: string indices must be integers

When using the JSON formatted config file, I didn't have it broken up into groups (each line was standalone) and when calling these lines with config[], it correctly assigned the proper values.

Most of the discussion I've seen about this involves iteration, so I'm not sure why the error is occurring here when using the yaml formatted file.

Upvotes: 1

Views: 258

Answers (1)

Dmitry Kuzminov
Dmitry Kuzminov

Reputation: 6584

The problem is in your config.yaml file.

samples:
  sample_names:samples.csv
  sample_files:files.csv
  sample_comparisons:comps.csv

The keys and values shall be separated with colon followed with space. I guess that in your case YAML parser treats the samples section not as a dictionary but as a list of strings: config["samples"] == ["sample_names:samples.csv", "sample_files:files.csv", "sample_comparisons:comps.csv"].

The correct config should be:

samples:
  sample_names: samples.csv
  sample_files: files.csv
  sample_comparisons: comps.csv

Upvotes: 1

Related Questions