Reputation: 69
I'm trying to use a 2 pass STAR mapping strategy (also explained here https://informatics.fas.harvard.edu/rsem-example-on-odyssey.html), but I'm getting an error.
I've read through this page [https://github.com/alexdobin/STAR/issues/181] and I have a similar issue, but the discussed solutions don't seem to help. Perhaps this is more a snakemake issue rather than a STAR issue, therefore I'm asking it here.
I use STAR version 2.7.10 on an HPC cluster. I'm running a snakemake file in which I map human and mouse samples with STAR 2 pass mapping, and get the following error:
['PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872', 'PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621'] ['SRR7942395_GSM3406786_sAML_Control_1', 'SRR7942395_GSM3406786_sAML_Control_1', 'SRR5195524_GSM2465521_KrasT_45649_NoDox', 'SRR5195524_GSM2465521_KrasT_45649_NoDox'] ['Homo_sapiens', 'Homo_sapiens', 'Mus_musculus', 'Mus_musculus'] [1, 2, 1, 2] ['GRCh38.106', 'GRCh38.106', 'GRCm39.107', 'GRCm39.107'] ['GRCh38', 'GRCh38', 'GRCm39', 'GRCm39']
The flag 'directory' used in rule all is only valid for outputs, not inputs.
The flag 'directory' used in rule all is only valid for outputs, not inputs.
The flag 'directory' used in rule all is only valid for outputs, not inputs.
The flag 'directory' used in rule all is only valid for outputs, not inputs.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 56
Rules claiming more threads will be scaled down.
Job counts:
count jobs
2 RSEM_calculate_expression
2 RSEM_prepare_reference
1 all
2 filter
2 genebody_coverage
2 infer_strandedness
4 rawFastqc
2 run_multiqc
2 samtools_index_bam
1 starIndex
2 star_1pass_alignment
2 star_2pass_alignment
2 ucsc_gtftobed_Homo_sapiens
26
[Tue Aug 16 13:09:01 2022]
rule RSEM_prepare_reference:
input: /DATA//resources/Homo_sapiens.GRCh38.dna.primary_assembly.fa, /DATA//resources/Homo_sapiens.GRCh38.106.gtf
output: /DATA//resources/RSEM_ref/Homo_sapiens_GRCh38.106/Homo_sapiens_GRCh38.106_GRCh38_rsem_ref.seq
jobid: 19
wildcards: species=Homo_sapiens, gtf_version=GRCh38.106, fa_version=GRCh38
threads: 12
[Tue Aug 16 13:09:01 2022]
rule RSEM_prepare_reference:
input: /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa, /DATA//resources/Mus_musculus.GRCm39.107.gtf
output: /DATA//resources/RSEM_ref/Mus_musculus_GRCm39.107/Mus_musculus_GRCm39.107_GRCm39_rsem_ref.seq
jobid: 20
wildcards: species=Mus_musculus, gtf_version=GRCm39.107, fa_version=GRCm39
threads: 12
[Tue Aug 16 13:09:01 2022]
rule star_1pass_alignment:
input: /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz, /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz
output: /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Aligned.sortedByCoord.out.bam, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Log.final.out, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_SJ.out.tab
jobid: 10
wildcards: dataset=PRJNA362883_GSE93946_SRP097621, sample=SRR5195524_GSM2465521_KrasT_45649_NoDox, species=Mus_musculus
threads: 12
[Tue Aug 16 13:09:01 2022]
rule starIndex:
input: /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa, /DATA//resources/Mus_musculus.GRCm39.107.gtf
output: /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107
jobid: 2
wildcards: species=Mus_musculus, fa_version=GRCm39, gtf_version=GRCm39.107
threads: 20
Activating conda environment: /DATA//workflow/.snakemake/conda/d88d0970
Activating conda environment: /DATA//workflow/.snakemake/conda/d88d0970
Activating conda environment: /DATA//workflow/.snakemake/conda/b20308a2
Activating conda environment: /DATA//workflow/.snakemake/conda/b20308a2
rsem-extract-reference-transcripts {config[project_path]}+resources/RSEM_ref/Mus_musculus_GRCm39.107/Mus_musculus_GRCm39.107_GRCm39 0 /DATA//resources/Mus_musculus.GRCm39.107.gtf None 0 /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa
STAR --runThreadN 20 --runMode genomeGenerate --genomeDir /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107 --genomeFastaFiles /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile /DATA//resources/Mus_musculus.GRCm39.107.gtf
STAR version: 2.7.10a compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Aug 16 13:09:13 ..... started STAR run
Aug 16 13:09:13 ... starting to generate Genome files
STAR --runMode alignReads --genomeDir /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107 --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 10000000000 --limitGenomeGenerateRAM 20000000000 --readFilesIn /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz --runThreadN 12 --readFilesCommand gunzip -c --outFileNamePrefix /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_
STAR version: 2.7.10a compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Aug 16 13:09:13 ..... started STAR run
Aug 16 13:09:13 ..... loading genome
EXITING because of FATAL ERROR: could not open genome file /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107//genomeParameters.txt
SOLUTION: check that the path to genome files, specified in --genomeDir is correct and the files are present, and have user read permsissions
Aug 16 13:09:13 ...... FATAL ERROR, exiting
[Tue Aug 16 13:09:13 2022]
Error in rule star_1pass_alignment:
jobid: 10
output: /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Aligned.sortedByCoord.out.bam, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Log.final.out, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_SJ.out.tab
conda-env: /DATA//workflow/.snakemake/conda/d88d0970
shell:
STAR --runMode alignReads --genomeDir /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107 --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 10000000000 --limitGenomeGenerateRAM 20000000000 --readFilesIn /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz --runThreadN 12 --readFilesCommand gunzip -c --outFileNamePrefix /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
I've attached my log file.
import pandas as pd
#import glob
import os
import fnmatch
import re
# --- Importing Configuration Files --- #
configfile: "/DATA//config/config.yaml"
#DATASET,SAMPLE,SPECIES,FRR, =glob_wildcards(config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_{frr}.fastq.gz")
#print(DATASET,SAMPLE,SPECIES,FRR)
#SPECIES,GTF_VERSION,=glob_wildcards(config["project_path"]+"resources/{gtf_species}.{gtf_version}.gtf")
#SPECIES,FA_VERSION,=glob_wildcards(config["project_path"]+"resources/{fa_species}-{fa_version}.dna.primary_assembly.fa")
#print(SPECIES,GTF_VERSION)
table_cols = ['dataset','sample','species','frr','gtf_version','fa_version']
table_samples = pd.read_table('/DATA//config/samples.tsv', header=0, sep='\t', names=table_cols)
DATASET = table_samples.dataset.values.tolist()
SAMPLE = table_samples['sample'].values.tolist()
SPECIES = table_samples.species.values.tolist()
FRR = table_samples.frr.values.tolist()
GTF_VERSION = table_samples.gtf_version.values.tolist()
FA_VERSION = table_samples.fa_version.values.tolist()
print(DATASET,SAMPLE,SPECIES,FRR,GTF_VERSION,FA_VERSION)
rule all:
input:
directory(expand(config["project_path"]+"resources/starIndex_{species}_{fa_version}_{gtf_version}",zip, species=SPECIES, fa_version=FA_VERSION, gtf_version=GTF_VERSION)),
expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, frr=FRR),
expand(config["project_path"]+"results/{dataset}/rawQC/multiqc_report.html", dataset=DATASET),
expand(config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_Aligned.sortedByCoord.out.bam", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
expand(config["project_path"]+"results/{dataset}/star_aligned_2pass/{sample}_{species}_Aligned.sortedByCoord.out.bam", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
expand(config["project_path"]+"results/{dataset}/star_aligned_2pass/{sample}_{species}_Aligned.sortedByCoord.out.bam.bai", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
expand(config["project_path"]+"resources/{species}.{gtf_version}.bed", zip, species=SPECIES, gtf_version=GTF_VERSION),
expand(config["project_path"]+"results/{dataset}/star_aligned_2pass/{sample}_{species}_Aligned.toTranscriptome.out.bam", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_{gtf_version}.geneBodyCoverage.txt", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, gtf_version=GTF_VERSION),
expand(config["project_path"]+"resources/RSEM_ref/{species}_{gtf_version}/{species}_{gtf_version}_{fa_version}_rsem_ref.seq", zip, species=SPECIES, gtf_version=GTF_VERSION, fa_version=FA_VERSION),
expand(config["project_path"]+"results/{dataset}/RSEM/{sample}_{species}_{gtf_version}_{fa_version}.genes.results", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, gtf_version=GTF_VERSION, fa_version=FA_VERSION)
wildcard_constraints:
dataset="|".join([re.escape(x) for x in DATASET]),
sample="|".join([re.escape(x) for x in SAMPLE]),
species="|".join([re.escape(x) for x in SPECIES]),
gtf_version="|".join([re.escape(x) for x in GTF_VERSION]),
fa_version="|".join([re.escape(x) for x in FA_VERSION])
## rule starIndex ## Create star index if it does not exist yet
rule starIndex:
input:
fasta=config["project_path"]+"resources/{species}.{fa_version}.dna.primary_assembly.fa",
gtf=config["project_path"]+"resources/{species}.{gtf_version}.gtf"
output:
directory(config["project_path"]+"resources/starIndex_{species}_{fa_version}_{gtf_version}")
threads:
20
conda:
"envs/DTPedia_bulkRNAseq.yaml"
shell:
"""
STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output} --genomeFastaFiles {input.fasta} --sjdbGTFfile {input.gtf}
"""
# function determine_species_fasta # function for determining fasta input of correct species to rule starIndex
def determine_species(wildcards,input):
if fnmatch.fnmatch(input.read1, '*Homo_sapiens*'):
return directory(expand(rules.starIndex.output, species = "Homo_sapiens", fa_version="GRCh38", gtf_version="GRCh38.106"))
elif fnmatch.fnmatch(input.read1, '*Mus_musculus*'):
return directory(expand(rules.starIndex.output, species="Mus_musculus", fa_version="GRCm39", gtf_version="GRCm39.107"))
rule star_1pass_alignment:
input:
read1=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_1.fastq.gz",
read2=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_2.fastq.gz"
output:
bam=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_Aligned.sortedByCoord.out.bam",
log=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_Log.final.out",
sj_1pass=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_SJ.out.tab"
params:
index=determine_species,
prefix=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_"
threads:
12
conda:
"envs/DTPedia_bulkRNAseq.yaml"
shell:
"""
STAR --runMode alignReads --genomeDir {params.index} --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 10000000000 --limitGenomeGenerateRAM 20000000000 --readFilesIn {input.read1} {input.read2} --runThreadN {threads} --readFilesCommand gunzip -c --outFileNamePrefix {params.prefix}
"""
Output of $ulimit -a is
$ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2063155
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 2063155
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
It seems like the STAR 1st pass rule is being run before the STAR index rule and therefore giving an error that it cannot find the genome index files in
/DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107/
. It's not clear how to solve this, am I doing something wrong in snakemake, STAR or something else that I'm overlooking? Help would be very much appreciated, thanks!
EDIT 1: After modifying the code as @dariober suggested, the initial error disappeared. Now I'm getting the following error:
WorkflowError in line ...: Function did not return str or list of str
I've modified the function def determine_species
according to the suggestions. It's getting closer but not fully there yet:
def determine_species(wildcards):
read1 = config["project_path"]+"resources/raw_datasets/{wildcards.dataset}/{wildcards.sample}_{wildcards.species}_RNA-Seq_1.fastq.gz"
Hs_index = "/DATA/m.venkatesan/DTPedia/resources/starIndex_Homo_sapiens_GRCh38_GRCh38.106"
Ms_index = "/DATA/m.venkatesan/DTPedia/resources/starIndex_Mus_musculus_GRCm39_GRCm39.107"
if fnmatch.fnmatch(read1, '*Homo_sapiens*'):
return Hs_index
elif fnmatch.fnmatch(read1, '*Mus_musculus*'):
return Ms_index
Upvotes: 1
Views: 930
Reputation: 9062
The problem is that the input to your rule star_1pass_alignment
is just the two fastq files:
rule star_1pass_alignment:
input:
read1=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_1.fastq.gz",
read2=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_2.fastq.gz"
...
However, STAR requires also the index of the reference genome, which you list as a parameter. This means that snakemake doesn't "see" that rule starIndex
should run before star_1pass_alignment
. From snakemake's point of view, rule starIndex
is unnecessary.
Therefore, the solution is to make the output of starIndex
(resources/starIndex_{species}_{fa_version}_{gtf_version}
) an input of star_1pass_alignment
. Something like this, not tested:
rule star_1pass_alignment:
input:
read1=...
read2=...
index=determine_species,
prefix=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_",
Rule priorities should not matter here.
Regarding the error about TypeError: determine_species() missing 1 required positional argument: 'input'
(this should be a separate question): You defined determine_species
to take two parameters: wildcards
and input
. However, the code index=determine_species,
calls this function passing to it only the wildcards
object, hence the error (I find this a bit cryptic but that's the way it is...). The solution is to "build" the input parameter input
inside the function itself using the wildcards
object. That is something like:
def determine_species(wildcards):
read1 = config["project_path"]+"resources/raw_datasets/{wildcards.dataset}/{wildcards.sample}_{wildcards.species}_RNA-Seq_1.fastq.gz",
if fnmatch.fnmatch(read1, '*Homo_sapiens*'):
return expand(rules.starIndex.output, species = "Homo_sapiens", fa_version="GRCh38", gtf_version="GRCh38.106")
elif fnmatch.fnmatch(read1, '*Mus_musculus*'):
return expand(rules.starIndex.output, species="Mus_musculus", fa_version="GRCm39", gtf_version="GRCm39.107")
(NB: I haven't checked whether what this function does!)
If you need to use a function that takes additional parameters you can use this construct:
def my_func(wildcards, arg1, arg2):
# Do something with wildcards, arg1, arg2
input:
foo = lambda wildcards: my_func(wildcards, arg1, arg2),
Upvotes: 2