lynkyra
lynkyra

Reputation: 59

Iterate through files in a directory, create output files, linux

I am trying to iterate through every file in a specific directory (called sequences), and perform two functions on each file. I know that the functions (the 'blastp' and 'cat' lines) work, since I can run them on individual files. Ordinarily I would have a specific file name as the query, output, etc., but I'm trying to use a variable so the loop can work through many files.

(Disclaimer: I am new to coding.) I believe that I am running into serious problems with trying to use my file names within my functions. As it is, my code will execute, but it creates a bunch of extra unintended files. This is what I intend for my script to do:

Line 1: Iterate through every file in my "sequences" directory. (All of which end with ".fa", if that is helpful.)

Line 3: Recognize the filename as a variable. (I know, I know, I think I've done this horribly wrong.)

Line 4: Run the blastp function using the file name as the argument for the "query" flag, always use "database.faa" as the argument for the "db" flag, and output the result in a new file that is has the same name as the initial file, but with ".txt" at the end.

Line 5: Output parts of the output file from line 4 into a new file that has the same name as the initial file, but with "_top_hits.txt" at the end.

for sequence in ./sequences/{.,}*;
    do
            echo "$sequence";
            blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7
            cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt
    done

When I ran this code, it gave me six new files derived from each file in the directory (and they were all in the same directory - I'd prefer to have them all in their own folders. How can I do that?). They were all empty. Their suffixes were, ".txt", ".txt.txt", ".txt_top_hits.txt", "_top_hits.txt", "_top_hits.txt.txt", and "_top_hits.txt_top_hits.txt".

If I can provide any further information to clarify anything, please let me know.

Upvotes: 3

Views: 1998

Answers (3)

LucasCortes
LucasCortes

Reputation: 29

You should be using *.fa if you only want files with a .fa ending. Additionally, if you want to redirect your output to new folders you need to create those directories somewhere using

mkdir 'folder_name'

then you need to redirect your -o outputs to those files, something like this

'command' -o /path/to/output/folder

To help you test this script out, you can run each line one by one to test them. You need to make sure each line works by itself before combining.

One last thing, be careful with your use of colons, it should look something like this:

for filename in *.fa; do 'command'; done 

Upvotes: 0

bli
bli

Reputation: 8184

I can propose you the following improvements:

for fasta_file in ./sequences/*.fa # ";" is not necessary if you already have a new line for your "do"
do
    # ${variable%something} is the part of $variable
    # before the string "something"
    # basename path/to/file is the name of the file
    # without the full path
    # $(some command) allows you to use the result of the command as a string
    # Combining the above, we can form a string based on our fasta file
    # This string can be useful to name stuff in a clean manner later
    sequence_name=$(basename ${fasta_file%.fa})
    echo ${sequence_name}
    # Create a directory for the results for this sequence
    # -p option avoids a failure in case the directory already exists
    mkdir -p ${sequence_name}
    # Define the name of the file for the results
    # (including our previously created directory in its path)
    blast_results=${sequence_name}/${sequence_name}_blast.txt
    blastp -query ${fasta_file} -db database.faa \
        -out ${blast_results} \
        -evalue 1e-10 -outfmt 7
    # Define a file name for the top hits
    top_hits=${sequence_name}/${sequence_name}_top_hits.txt
    # alternatively, using "%"
    #top_hits=${blast_results%_blast.txt}_top_hits.txt
    # No need to cat: awk can take a file as argument
    awk '/hits found/{getline;print}' ${blast_results} \
        | grep -v "#" > ${sequence_name}_top_hits.txt
done

I made more intermediate variables, with (hopefully) meaningful names. I used \ to escape line ends and allow putting commands in several lines. I hope this improves code readability.

I haven't tested. There may be typos.

Upvotes: 0

aardvarkk
aardvarkk

Reputation: 15996

If you're only interested in *.fa files I would limit your input to only those matching files like this:

for sequence in sequences/*.fa; do

Upvotes: 3

Related Questions