SemperCallide
SemperCallide

Reputation: 2080

Why is this command within my code giving different result than the same command in terminal?

**Edit: Okay, so I've tried implementing everyone's advice so far.

-I've added quotes around each variable "$1" and "$codon" to avoid whitespace.

-I've added the -ioc flag to grep to avoid caps.

-I tried using tr -d' ', however that leads to a runtime error because it says -d' ' is an invalid option.

Unfortunately I am still seeing the same problem. Or a different problem, which is that it tells me that every codon appears exactly once. Which is a different kind of wrong.

Thanks for everything so far - I'm still open to new ideas. I've updated my code below.**

I have this bash script that is supposed to count all permutations of (A C G T) in a given file.

One line of the script is not giving me the desired result and I don't know why - especially because I can enter the exact same line of code in the command prompt and get the desired result.

The line, executed in the command prompt, is:

cat dnafile | grep -o GCT | wc -l

This line tells me how many times the regular expression "GCT" appears in the file dnafile. When I run this command the result I get is 10 (which is accurate).

In the code itself, I run a modified version of the same command:

cat $1 | grep -o $codon | wc -l

Where $1 is the file name, and $codon is the 3-letter combination. When I run this from within the program, the answer I get is ALWAYS 0 (which is decidedly not accurate).

I was hoping one of you fine gents could enlighten this lost soul as to why this is not working as expected.

Thank you very, very much!

My code:

#!/bin/bash
#countcodons <dnafile> counts occurances of each codon in sequence contained within <dnafile> 


if [[ $# != 1 ]] 
    then echo "Format is: countcodons <dnafile>"
    exit
fi

nucleos=(a c g t)
allCods=()

#mix and match nucleotides to create all codons

for x in {0..3}
do 
    for y in {0..3}
    do 
        for z in {0..3}
        do 
            perm=${nucleos[$x]}${nucleos[$y]}${nucleos[$z]}     
            allCods=("${allCods[@]}" "$perm") 
        done
    done
done


#for each codon, use grep to count # of occurances in file

len=${#allCods[*]} 
for (( n=0; n<len; n++ ))
do
    codon=${allCods[$n]}
    occs=`cat "$1" | grep -ioc "$codon" | wc -l`

    echo "$codon appears: $occs"    
#   if (( $occs > 0 ))
#   then
#       echo "$codon : $occs"
#   fi
done

exit

Upvotes: 0

Views: 202

Answers (3)

Barmar
Barmar

Reputation: 782437

Try:

occs=`cat $1 | grep -o $codon | wc -l | tr -d ' '`

The problem is that wc indents the output, so $occs has a bunch of spaces at the beginning.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 204498

You've got your logic backwards - you shouldn't have to read your input file once for every codon, you should only have to read it once and check each line for every codon.

You didn't supply any sample input or expected output so it's untested but something like this is the right approach:

awk '
BEGIN {
    nucleosStr="a c g t"
    split(nucleosStr,nucleos)

    #mix and match nucleotides to create all codons
    for (x in nucleos) {
        for (y in nucleos) {
            for (z in nucleos) {
                perm = nucleos[x] nucleos[y] nucleos[z]    
                allCodsStr = allCodsStr (allCodsStr?" ":"") perm
            }
        }
    }

    split(allCodsStr,allCods)
}
{
    #for each codon, count # of occurances in file
    for (n in allCods) {
        codon = allCods[n]
        if ( tolower($0) ~ codon ) {
            occs[n]++
        }
    }
}

END {
    for (n in allCods) {
        printf "%s appears: %d\n", allCods[n], occs[n]
    }
}
' "$1"

I expect you'll see a huge performance improvement with that approach if your file is moderately large.

Upvotes: 0

Diego Basch
Diego Basch

Reputation: 13079

You're generating your sequences in lowercase. Your code greps for gct, not GCT. You want to add the -i switch to grep. Try:

occs=`grep -ioc $codon $1`

Upvotes: 3

Related Questions