katiesf0106
katiesf0106

Reputation: 1

C - saving one file to two separate arrays and printing each element

I have a file of DNA sequences and an associated IDs and I'm trying to save the even lines (the IDs) to one array and the odd lines (the sequences) to another. Then I want to compare all of the sequences with each other to find the unique sequences. For example is Seq A is AGTCGAT and Seq B is TCG, Seq B is not unique. I want to save the unique sequences and their IDs to an output file and id the sequences are not unique, only save the ID to the output file and print "Deleting sequence with ID: " to the console. I'm pretty much done but Im running into a few problems. I tried printing out the two separate arrays, sequences[] and headers[], but for some reason, they only contain two out of the 5 strings (the file has 5 IDs and 5 headers). And then the information isn't printing out to the screen the way it's supposed to.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(){

  int total_seq = 20000;
  char seq[900];
  char** headers;
  char** sequences;
  int sequence_size = 0;

  headers = malloc(total_seq * sizeof(char*));
  sequences = malloc(total_seq * sizeof(char*));

  int index;
  for(index = 0; index < total_seq; index++){
    headers[index] = malloc(900 * sizeof(char));
    sequences[index] = malloc(900 * sizeof(char));
  }

  FILE *dna_file;
  FILE *new_file;
  dna_file = fopen("inabc.fasta", "r");
  new_file = fopen("output.fasta", "w");

  if (dna_file == NULL){
    printf("Error");
    return 0;
  }

  int i = 0;
  int j = 0;
  while(fgets(seq, sizeof seq, dna_file)){
    if(i%2 == 0){
      strcpy(headers[i/2], seq);
      i++;
    }
    else{
      strcpy(sequences[i/2], seq);
      i++;
    }
  }


  fclose(dna_file);
  sequence_size = i/2;

  char* result;
  for(i=0; i < sequence_size; i++){
    for(j=0; j < sequence_size; j++){
      if(i==j){
        continue;
    }
      result = strstr(sequences[j], sequences[i]);
      if(result== NULL){
        fprintf(new_file,"%s", headers[i]);
        fprintf(new_file,"%s", sequences[i]);
      }
      else{
        printf("Deleting sequence with id: %s \n", headers[i]);
        printf(sequences[i]);
        fprintf(new_file,"%s", headers[i]);
      }
    }
  }

The sample file inabc.fasta is short but the actual file I'm working with is very long, which is why I've used malloc. Any help would be appreciated!

EDIT: The sample input file inabc.fasta:

cat inabc.fasta

> id1 header1
abcd
> id2 header2
deghj
> id3 header3
defghijkabcd
> id4 header4
abcd
> id5 header5
xcvbnnmlll

So for this sample, sequences 1 and 4 will not be saved to the output file

Upvotes: 0

Views: 63

Answers (2)

David C. Rankin
David C. Rankin

Reputation: 84579

In addition to the other comments, there are a few logic errors in your output routine you need to correct. Below, I have left your code in comments, so you can follow the changes and additions made.

There are several ways you can approach memory management a bit more efficiently, and provide a way to cleanly iterate over your data without tracking counters throughout your code. Specifically, when you allocate your array of pointers-to-pointer-to-char, use calloc instead of malloc so that your pointers are initialized to zero/NULL. This allows you to easily iterate over only those pointers that have been assigned.

There is no need to allocate 20000 900 char arrays (times 2) before reading your data. Allocate your pointers (or start with a smaller number of pointers say 256 and realloc as needed), then simply allocate for each element in headers and sequences as needed within your read loop. Further, instead of allocating 1800 chars (900 * 2) every time you add an element to headers and sequences, just allocate the memory required to hold the data. This can make a huge difference. For example, you allocate 20000 * 900 * 2 = 36000000 bytes (36M) before you start reading this small set of sample data. Even allocating all 20000 pointers, allocating memory as needed for this sample data, limits memory usage to 321,246 bytes (less that 1% of 36M)

The logic in your write loop will not work. You must move your write of the data outside of the inner loop. Otherwise you have no way of testing whether to Delete a duplicate entry. Further testing result does not provide a way to skip duplicates. result changes with every iteration of the inner loop. You need to both test and set a flag that will control whether or not to delete the duplicate once you leave the inner loop.

Finally, since you are allocating memory dynamically, you are responsible for tracking the memory allocated and freeing the memory when no longer needed. Allocating your array of pointers with calloc makes freeing the memory in use a snap.

Take a look at the changes and additions I'm made to your code. Understand the changes and let me know if you have any questions. Note: there are many checks omitted for sake of not cluttering the code. You should at minimum make sure you do not exceed the 20000 pointers allocated when run on a full dataset, and realloc as needed. You should also check that strdup succeeded (it's allocating memory), although you get some assurance comparing the headers and sequences index count. I'm sure there are many more that make sense. Good luck.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXSEQ 20000
#define SZSEQ 900

int main ()
{

    int total_seq = MAXSEQ;     /* initialize all variables */
    char seq[SZSEQ] = {0};
    char **headers = NULL;      /* traditionally variables  */
    char **sequences = NULL;    /* declared at beginning    */
    // char *result = NULL;
    // int sequence_size = 0;
    size_t len = 0;
    int hidx = 0;
    int sidx = 0;
    // int idx = 0;      /* (see alternative in fgets loop) */
    int i = 0;
    int j = 0;
    int del = 0;

    /* calloc initilizes to 0 & allows iteration on addresses */
    headers = calloc (total_seq, sizeof (*headers));
    sequences = calloc (total_seq, sizeof (*sequences));

    /* allocate as needed if possible - see read loop  */
//     for (index = 0; index < total_seq; index++) {
//         headers[index] = malloc (900 * sizeof (char));
//         sequences[index] = malloc (900 * sizeof (char));
//     }

    FILE *dna_file = NULL;
    FILE *new_file = NULL;
    dna_file = fopen ("inabc.fasta", "r");
    new_file = fopen ("output.fasta", "w+");    /* create if not existing "w+"  */

    if (!dna_file || !new_file) {
        fprintf (stderr, "Error: file open failed.\n");
        return 1;                               /* 1 indicates error condition  */
    }

    while (fgets (seq, sizeof (seq), dna_file)) /* read dna_file & separate     */
    {
        len = strlen (seq);                     /* strip newline from seq end   */
        if (seq[len-1] == '\n')                 /* it's never good to leave \n  */
            seq[--len] = 0;                     /* scattered through your data  */

        /* if header line has '>' as first char -- use it!  */
        if (*seq == '>')
            headers[hidx++] = strdup (seq);     /* strdup allocates             */
        else
            sequences[sidx++] = strdup (seq);

        /* alternative using counter if no '>'  */
//         if (idx % 2 == 0)
//             headers[hidx++] = strdup (seq);
//         else
//             sequences[sidx++] = strdup (seq);
//         idx++
    }

    fclose (dna_file);

    if (hidx != sidx)
        fprintf (stderr, "warning: hidx:sidx (%d:%d) differ.\n", hidx, sidx);

//     sequence_size = (hidx>sidx) ? sidx : hidx;  /* protect against unequal read */
//     
//     for (i = 0; i < sequence_size; i++) {
//         for (j = 0; i < sequence_size; i++) {
//             if (i == j) {
//                 continue;
//             }
//             result = strstr (sequences[j], sequences[i]);
//             if (result == NULL) {
//                 fprintf (new_file, "%s", headers[i]);
//                 fprintf (new_file, "%s", sequences[i]);
//             } else {
//                 printf ("Deleting sequence with id: %s \n", headers[i]);
//                 printf (sequences[i]);
//                 fprintf (new_file, "%s", headers[i]);
//             }
//         }
//     }

    /* by using calloc, all pointers except those assigned are NULL */
    while (sequences[i])    /* testing while (sequences[i] != NULL) */
    {
        j = 0;
        del = 0;
        while (sequences[j])
        {
            if (i == j)
            {
                j++;
                continue;
            }

            if (strstr (sequences[j], sequences[i]))    /* set delete flag  */
            {
                del = 1;
                break;
            }
            j++;
        }

        if (del) {
            printf ("Deleting id: '%s' with seq: '%s' \n", headers[i], sequences[i]);
            // printf (sequences[i]);
            fprintf (new_file, "%s\n", headers[i]);
        } else {
            fprintf (new_file, "%s\n", headers[i]);
            fprintf (new_file, "%s\n", sequences[i]);
        }
        i++;
    }

    fclose (new_file);

    /* free allocated memory - same simple iteration */
    i = 0;
    while (headers[i])
        free (headers[i++]);      /* free strings allocated by strdup */
    if (headers) free (headers);  /* free the array of pointers       */

    i = 0;
    while (sequences[i])
        free (sequences[i++]);
    if (sequences) free (sequences);

    return 0;
}

output:

$ ./bin/dnaio
Deleting id: '> id1 header1' with seq: 'abcd'
Deleting id: '> id4 header4' with seq: 'abcd'

output.fasta:

$ cat output.fasta
> id1 header1
> id2 header2
deghj
> id3 header3
defghijkabcd
> id4 header4
> id5 header5
xcvbnnmlll

memory allocation/free verification:

==21608== HEAP SUMMARY:
==21608==     in use at exit: 0 bytes in 0 blocks
==21608==   total heap usage: 14 allocs, 14 frees, 321,246 bytes allocated
==21608==
==21608== All heap blocks were freed -- no leaks are possible

Upvotes: 0

Crowman
Crowman

Reputation: 25926

This:

while( fgets(seq, sizeof seq, dna_file) ) {
    if( i % 2 == 0 ){
        strcpy(headers[i], seq);
        i++;
    }
    else {
        strcpy(sequences[i-1], seq);
        i++;
    }
}

is going to skip every other element in your arrays:

  1. When i == 0, it'll store in headers[0]
  2. When i == 1, it'll store in sequences[0]
  3. When i == 2, it'll store in headers[2]
  4. When i == 3, it'll store in sequences[2]

and so on.

Then you do:

sequence_size = i/2;

so if you loop sequence_size times, you'll only make it half way through the piece of the array you've written to, and every other element you print will be uninitialized. This is why you're only printing half the elements (if you have 5 elements, then i / 2 == 2, and you'll only see 2), and why it "isn't printing out to the screen the way it's supposed to".

You'll be better off just using either two separate counters when you read in your input, and a separate variable to store whether you're on an odd or even line of input.

For instance:

int i = 0, j = 0, even = 1;
while( fgets(seq, sizeof seq, dna_file) ) {
    if( even ){
        strcpy(headers[i++], seq);
        even = 0;
    }
    else {
        strcpy(sequences[j++], seq);
        even = 1;
    }
}

Here it's better to have two variables, since if you read in an odd number of lines, your two arrays will contain different numbers of read elements.

Upvotes: 1

Related Questions