user3704616
user3704616

Reputation: 7

Trying to find frequency of certain characters in a string but the results are very far off

I am trying to write a program which calculates and prints the GC content of a string of DNA(which is input through a txt file). That is, the percentage of G's and C's in a string of DNA. Here is my function for the GC percentage:

void updateGCCount(char s[], int *gc, int *at) {
   char c[MAXLENGTH];
   int i,GCcount,ATcount;
   float len,GCpercentage;
   GCcount=0;
   ATcount=0;
   for(i=0;c[i]!='\0';++i)
   {
     if(c[i]=='G' || c[i]=='C')
     {
       ++GCcount;
       *gc=GCcount;
     }
     if(c[i]=='A' || c[i]=='T')
     {
       ++ATcount;
       *at=ATcount;
     }
   }
   strcpy(c,s);
   len=strlen(c);
   GCpercentage=*gc/len;
   printf("GC-content: %.2f\n",GCpercentage);

}

This is my function definition, and the part which is supposed to correctly print the GC percentage is what I am not sure about. Below is my main program which utilizes the input text file.

#include "genomics.h"

int main(){
   char s[MAXLENGTH];
   int gc, at;
   scanf("%s",s);
   printf("Sequence  : %s\n",s);
   updateGCCount(s, &gc, &at);

   return 0;
}

Any help or advice on why I am not getting a correct value for the GCpercentage would be great. Thank you in advance

Upvotes: 0

Views: 99

Answers (3)

Norman Gray
Norman Gray

Reputation: 12514

That's a strongly un-idiomatic program. Consider the following.

#include <stdio.h>
#include <stdlib.h> /* for exit(3) */

float count_gc(const char* s) 
{

You have no need to pass information back via variables passed in by reference. Functions return values -- typically 'the answer'.

You're simply scanning the content of the argument string s, so there's no need to copy it anywhere.

As others have pointed out, you were scanning the contents of the array c[] before you copied anything in to it -- you were counting 'G' and 'C' in a (probably large) random block of memory. Keeping things simple avoids mistakes like that.

    int nvalid = 0;
    int gccount = 0;
    float result;

    for (; *s != '\0'; s++) {

Although the for loop you wrote isn't wrong, it's somewhat un-idiomatic. Here, we examine the character pointed to by the pointer s, and then increment the pointer, until we find ourselves pointing at the \0 that terminates the string. Yes, this means we 'lose' the initial value of the argument, but we don't need it after the loop, so that doesn't matter.

        switch (*s) {

A switch is a more natural construction here. You're looking for a small set of possible values that *s (that is, the character the pointer is currently pointing at) may have.

          case 'G':
          case 'C':
            nvalid++;
            gccount++;
            break;

          case 'A':
          case 'T':
            nvalid++;
            break;

          default:
            /* unexpected character -- ignore it */
            break;

Every switch statement should have a default clause -- one should always think of what's supposed to happen if none of the case clauses match. In this case, we just ignore this character.

        }
    }
    if (nvalid == 0) {
        fprintf(stderr, "No valid letters found!\n");
        result = 0.0;
    } else {
        /* Multiply by 1.0 to convert integer gccount to a float */
        result = 1.0*gccount / nvalid;
    }
    return result;

We return the result to the caller rather than printing it out inside the function. Functions shouldn't 'chatter', but leave all of the I/O in one place, typically leaving the main function (or something higher up) to look after that.

}

int main(int argc, char** argv) 
{
    if (argc != 2) {
        /* Give the user a hint on how to call the program */
        fprintf(stderr, "Usage: gcat <string>\n");
        exit(1);
    }

    printf("Sequence GC-content = %g\n", count_gc(argv[1]));
}

I run that with:

% cc -o gcat gcat.c
% ./gcat "GCAT ATx foo"         
Sequence GC-content = 0.333333
%

With C, it's very easy to tie yourself in knots, very quickly. Aim for simplicity always.

Upvotes: 0

Morad
Morad

Reputation: 2779

c is not initialize, so *gc and *at are not updated at all and they contain garbage..

here you should use s instead of c

for(i=0;c[i]!='\0';++i)
   {
     if(c[i]=='G' || c[i]=='C')
     {
       ++GCcount;
       *gc=GCcount;
     }
     if(c[i]=='A' || c[i]=='T')
     {
       ++ATcount;
       *at=ATcount;
     }
   }

Upvotes: 1

Ben
Ben

Reputation: 2133

You're doing your tests on char array "c":

char c[MAXLENGTH];
...
for(i=0;c[i]!='\0';++i)
{
 if(c[i]=='G' || c[i]=='C')
 {
   ++GCcount;
   *gc=GCcount;
 }
 if(c[i]=='A' || c[i]=='T')
 {
   ++ATcount;
   *at=ATcount;
 }

}

If should be on s, the array that you passed in. The c array is probably superflous, you should be able to get the length from s as well

Upvotes: 1

Related Questions