STAR error in snakemake pipeline: "EXITING because of FATAL ERROR: could not open genome file"

Question

I'm trying to use a 2 pass STAR mapping strategy (also explained here https://informatics.fas.harvard.edu/rsem-example-on-odyssey.html), but I'm getting an error.

I've read through this page [https://github.com/alexdobin/STAR/issues/181] and I have a similar issue, but the discussed solutions don't seem to help. Perhaps this is more a snakemake issue rather than a STAR issue, therefore I'm asking it here.

I use STAR version 2.7.10 on an HPC cluster. I'm running a snakemake file in which I map human and mouse samples with STAR 2 pass mapping, and get the following error:

['PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872', 'PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621'] ['SRR7942395_GSM3406786_sAML_Control_1', 'SRR7942395_GSM3406786_sAML_Control_1', 'SRR5195524_GSM2465521_KrasT_45649_NoDox', 'SRR5195524_GSM2465521_KrasT_45649_NoDox'] ['Homo_sapiens', 'Homo_sapiens', 'Mus_musculus', 'Mus_musculus'] [1, 2, 1, 2] ['GRCh38.106', 'GRCh38.106', 'GRCm39.107', 'GRCm39.107'] ['GRCh38', 'GRCh38', 'GRCm39', 'GRCm39']
The flag 'directory' used in rule all is only valid for outputs, not inputs.
The flag 'directory' used in rule all is only valid for outputs, not inputs.
The flag 'directory' used in rule all is only valid for outputs, not inputs.
The flag 'directory' used in rule all is only valid for outputs, not inputs.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 56
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    2   RSEM_calculate_expression
    2   RSEM_prepare_reference
    1   all
    2   filter
    2   genebody_coverage
    2   infer_strandedness
    4   rawFastqc
    2   run_multiqc
    2   samtools_index_bam
    1   starIndex
    2   star_1pass_alignment
    2   star_2pass_alignment
    2   ucsc_gtftobed_Homo_sapiens
    26
 
[Tue Aug 16 13:09:01 2022]
rule RSEM_prepare_reference:
    input: /DATA//resources/Homo_sapiens.GRCh38.dna.primary_assembly.fa, /DATA//resources/Homo_sapiens.GRCh38.106.gtf
    output: /DATA//resources/RSEM_ref/Homo_sapiens_GRCh38.106/Homo_sapiens_GRCh38.106_GRCh38_rsem_ref.seq
    jobid: 19
    wildcards: species=Homo_sapiens, gtf_version=GRCh38.106, fa_version=GRCh38
    threads: 12
 
[Tue Aug 16 13:09:01 2022]
rule RSEM_prepare_reference:
    input: /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa, /DATA//resources/Mus_musculus.GRCm39.107.gtf
    output: /DATA//resources/RSEM_ref/Mus_musculus_GRCm39.107/Mus_musculus_GRCm39.107_GRCm39_rsem_ref.seq
    jobid: 20
    wildcards: species=Mus_musculus, gtf_version=GRCm39.107, fa_version=GRCm39
    threads: 12
 
[Tue Aug 16 13:09:01 2022]
rule star_1pass_alignment:
    input: /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz, /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz
    output: /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Aligned.sortedByCoord.out.bam, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Log.final.out, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_SJ.out.tab
    jobid: 10
    wildcards: dataset=PRJNA362883_GSE93946_SRP097621, sample=SRR5195524_GSM2465521_KrasT_45649_NoDox, species=Mus_musculus
    threads: 12
 
 
[Tue Aug 16 13:09:01 2022]
rule starIndex:
    input: /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa, /DATA//resources/Mus_musculus.GRCm39.107.gtf
    output: /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107
    jobid: 2
    wildcards: species=Mus_musculus, fa_version=GRCm39, gtf_version=GRCm39.107
    threads: 20
 
Activating conda environment: /DATA//workflow/.snakemake/conda/d88d0970
Activating conda environment: /DATA//workflow/.snakemake/conda/d88d0970
Activating conda environment: /DATA//workflow/.snakemake/conda/b20308a2
Activating conda environment: /DATA//workflow/.snakemake/conda/b20308a2
rsem-extract-reference-transcripts {config[project_path]}+resources/RSEM_ref/Mus_musculus_GRCm39.107/Mus_musculus_GRCm39.107_GRCm39 0 /DATA//resources/Mus_musculus.GRCm39.107.gtf None 0 /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa
    STAR --runThreadN 20 --runMode genomeGenerate --genomeDir /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107 --genomeFastaFiles /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile /DATA//resources/Mus_musculus.GRCm39.107.gtf
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Aug 16 13:09:13 ..... started STAR run
Aug 16 13:09:13 ... starting to generate Genome files
    STAR --runMode alignReads --genomeDir /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107 --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 10000000000 --limitGenomeGenerateRAM 20000000000 --readFilesIn /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz --runThreadN 12 --readFilesCommand gunzip -c --outFileNamePrefix /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Aug 16 13:09:13 ..... started STAR run
Aug 16 13:09:13 ..... loading genome
 
EXITING because of FATAL ERROR: could not open genome file /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107//genomeParameters.txt
SOLUTION: check that the path to genome files, specified in --genomeDir is correct and the files are present, and have user read permsissions
 
Aug 16 13:09:13 ...... FATAL ERROR, exiting
[Tue Aug 16 13:09:13 2022]
Error in rule star_1pass_alignment:
    jobid: 10
    output: /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Aligned.sortedByCoord.out.bam, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Log.final.out, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_SJ.out.tab
    conda-env: /DATA//workflow/.snakemake/conda/d88d0970
    shell:
        
        STAR --runMode alignReads --genomeDir /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107 --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 10000000000 --limitGenomeGenerateRAM 20000000000 --readFilesIn /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz --runThreadN 12 --readFilesCommand gunzip -c --outFileNamePrefix /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_
       
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I've attached my log file.

log.txt

import pandas as pd
#import glob
import os
import fnmatch
import re
 
# --- Importing Configuration Files --- #
configfile: "/DATA//config/config.yaml"
 
#DATASET,SAMPLE,SPECIES,FRR, =glob_wildcards(config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_{frr}.fastq.gz")
#print(DATASET,SAMPLE,SPECIES,FRR)
#SPECIES,GTF_VERSION,=glob_wildcards(config["project_path"]+"resources/{gtf_species}.{gtf_version}.gtf")
#SPECIES,FA_VERSION,=glob_wildcards(config["project_path"]+"resources/{fa_species}-{fa_version}.dna.primary_assembly.fa")
#print(SPECIES,GTF_VERSION)
 
table_cols = ['dataset','sample','species','frr','gtf_version','fa_version']
table_samples = pd.read_table('/DATA//config/samples.tsv', header=0, sep='	', names=table_cols)
DATASET = table_samples.dataset.values.tolist()
SAMPLE = table_samples['sample'].values.tolist()
SPECIES = table_samples.species.values.tolist()
FRR = table_samples.frr.values.tolist()
GTF_VERSION = table_samples.gtf_version.values.tolist()
FA_VERSION = table_samples.fa_version.values.tolist()
 
print(DATASET,SAMPLE,SPECIES,FRR,GTF_VERSION,FA_VERSION)
 
 
rule all:
        input:
                directory(expand(config["project_path"]+"resources/starIndex_{species}_{fa_version}_{gtf_version}",zip, species=SPECIES, fa_version=FA_VERSION, gtf_version=GTF_VERSION)),
                expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, frr=FRR),
                expand(config["project_path"]+"results/{dataset}/rawQC/multiqc_report.html", dataset=DATASET),
                expand(config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_Aligned.sortedByCoord.out.bam", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
                expand(config["project_path"]+"results/{dataset}/star_aligned_2pass/{sample}_{species}_Aligned.sortedByCoord.out.bam", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
                expand(config["project_path"]+"results/{dataset}/star_aligned_2pass/{sample}_{species}_Aligned.sortedByCoord.out.bam.bai", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
                expand(config["project_path"]+"resources/{species}.{gtf_version}.bed", zip, species=SPECIES, gtf_version=GTF_VERSION),
                expand(config["project_path"]+"results/{dataset}/star_aligned_2pass/{sample}_{species}_Aligned.toTranscriptome.out.bam", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
                expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_{gtf_version}.geneBodyCoverage.txt", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, gtf_version=GTF_VERSION),
                expand(config["project_path"]+"resources/RSEM_ref/{species}_{gtf_version}/{species}_{gtf_version}_{fa_version}_rsem_ref.seq", zip, species=SPECIES, gtf_version=GTF_VERSION, fa_version=FA_VERSION),
                expand(config["project_path"]+"results/{dataset}/RSEM/{sample}_{species}_{gtf_version}_{fa_version}.genes.results", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, gtf_version=GTF_VERSION, fa_version=FA_VERSION)
 
wildcard_constraints:
        dataset="|".join([re.escape(x) for x in DATASET]),
        sample="|".join([re.escape(x) for x in SAMPLE]),
        species="|".join([re.escape(x) for x in SPECIES]),
        gtf_version="|".join([re.escape(x) for x in GTF_VERSION]),
        fa_version="|".join([re.escape(x) for x in FA_VERSION])
 
## rule starIndex ##  Create star index if it does not exist yet
rule starIndex:
        input:
                fasta=config["project_path"]+"resources/{species}.{fa_version}.dna.primary_assembly.fa",
                gtf=config["project_path"]+"resources/{species}.{gtf_version}.gtf"
        output:
                directory(config["project_path"]+"resources/starIndex_{species}_{fa_version}_{gtf_version}")
        threads:
                20
        conda:
                "envs/DTPedia_bulkRNAseq.yaml"
        shell:
                """
                STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output} --genomeFastaFiles {input.fasta} --sjdbGTFfile {input.gtf}
                """
 
# function determine_species_fasta # function for determining fasta input of correct species to rule starIndex
def determine_species(wildcards,input):
        if fnmatch.fnmatch(input.read1, '*Homo_sapiens*'):
                return directory(expand(rules.starIndex.output, species = "Homo_sapiens", fa_version="GRCh38", gtf_version="GRCh38.106"))
        elif fnmatch.fnmatch(input.read1, '*Mus_musculus*'):
                return directory(expand(rules.starIndex.output, species="Mus_musculus", fa_version="GRCm39", gtf_version="GRCm39.107"))
 
 
rule star_1pass_alignment:
        input:
                read1=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_1.fastq.gz",
                read2=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_2.fastq.gz"
        output:
                bam=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_Aligned.sortedByCoord.out.bam",
                log=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_Log.final.out",
                sj_1pass=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_SJ.out.tab"
        params:
                index=determine_species,
                prefix=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_"
        threads:
                12
        conda:
                "envs/DTPedia_bulkRNAseq.yaml"
        shell:
                """
                STAR --runMode alignReads --genomeDir {params.index} --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 10000000000 --limitGenomeGenerateRAM 20000000000 --readFilesIn {input.read1} {input.read2} --runThreadN {threads} --readFilesCommand gunzip -c --outFileNamePrefix {params.prefix}
                """

Output of $ulimit -a is

$ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2063155
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2063155
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

It seems like the STAR 1st pass rule is being run before the STAR index rule and therefore giving an error that it cannot find the genome index files in /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107/. It's not clear how to solve this, am I doing something wrong in snakemake, STAR or something else that I'm overlooking? Help would be very much appreciated, thanks!

EDIT 1: After modifying the code as @dariober suggested, the initial error disappeared. Now I'm getting the following error:

WorkflowError in line ...: Function did not return str or list of str

I've modified the function def determine_species according to the suggestions. It's getting closer but not fully there yet:

def determine_species(wildcards):
        read1 = config["project_path"]+"resources/raw_datasets/{wildcards.dataset}/{wildcards.sample}_{wildcards.species}_RNA-Seq_1.fastq.gz"
        Hs_index = "/DATA/m.venkatesan/DTPedia/resources/starIndex_Homo_sapiens_GRCh38_GRCh38.106"
        Ms_index = "/DATA/m.venkatesan/DTPedia/resources/starIndex_Mus_musculus_GRCm39_GRCm39.107"
        if fnmatch.fnmatch(read1, '*Homo_sapiens*'):
                return Hs_index
        elif fnmatch.fnmatch(read1, '*Mus_musculus*'):
                return Ms_index

STAR error in snakemake pipeline: "EXITING because of FATAL ERROR: could not open genome file"

Answers (1)

Related Questions

STAR error in snakemake pipeline: &quot;EXITING because of FATAL ERROR: could not open genome file&quot;

Answers (1)

Related Questions

STAR error in snakemake pipeline: "EXITING because of FATAL ERROR: could not open genome file"