C - Prevent malloc() from writing over used memory

Question

Long story short, I'm trying to allocate memory dynamically to store the user input organized into two structs, word and sentence. Here are the struct definitions for them:

struct word {
    char* str; // Pointer to the character array
    int length; // The length of the word
    int posInSentence; // Where in the sentence this word is located
};

struct sentence {
    struct word* words; // Pointer to the words in this sentence
    int length; // The length of the sentence
    int line; // Which line this sentence is located in the collection
};

My program then calls takeInput() which, as the name suggests, takes the user's input:

struct sentence* takeInput(int* length) {
    int n = 0;
    struct sentence* out = malloc(sizeof(struct sentence));
    while (1) {
        out = (struct sentence*) realloc(out, (n + 1) * sizeof(struct sentence));
        char str[1000]; // The one execption where a defined array length is allowed
        printf("Enter your string: ");
        gets(str); // I know gets() is bad, but I can't be bothered to get scanf to work
        if (str[0] == '\0') { // Check to see if array is empty, since that means the end of user input
            break;
        }
        *(out + n) = splitString(str, (n + 1));
        printSentence((out + n), false);
        n++;
    }
    *length = n;
    return out;
}

This in turn calls splitString() to split the input into words and places them into a sentence:

struct sentence splitString(char* in, int line) {
    struct sentence out;
    char* split; // Pointer to the string tokens being split off with strtok()
    char delim = ' '; // Delimiter for strtok()
    
    int n = 0; // Used to count the number of words entered and for setting the position in sentence for each word
    struct word* w = (struct word *) malloc(sizeof(struct word)); // Pointer used for the array of words
    
    split = strtok(in, &delim); // Get the first token
    w->str = split;
    w->length = strlen(split);
    w->posInSentence = n + 1;
    n++;
    
    while (1) {
        w = (struct word *) realloc(w, (n + 1) * sizeof(struct word));
        split = strtok(NULL, &delim); // Get all tokens after
        if (split == NULL) {
            break; // Break if we've reached the end of the string
        }
        (w + n)->str = split;
        (w + n)->length = strlen(split);
        (w + n)->posInSentence = n + 1;
        n++;
    }
    
    out.words = w;
    out.length = n;
    out.line = line;
        
    return out;
}

The Issue

Printing out the sentences as they're entered is no issue:

Enter your string: Hello world
Hello
world
Enter your string: New Day
New
Day

The issue comes when you try to print out the list of sentences which is returned by takeInput(). As you can see by the GDB memory, malloc is writing over the same memory (address 0x7fffffffdb60) to store both times the user inputs a line (pay attention to address 0x55555555ae0 and 0x555555559c50, they both point to the words in each sentence)

GDB Memory Scan

Do you have any idea on how I can fix this so that it isn't writing over the same memory?

LSerni · Accepted Answer

Your problem lies here:

    char str[1000]; // The one execption where a defined array length is allowed

What do you do there? You allocate a block of memory on the stack. Always the same block. Then strtok operates on this block, and returns addresses in this block.

And you copy those addresses into *w, but they are always addresses in the same block that gets overwritten over and over at each call of the tokenizer. The words from the very last run will be correct, all the other pointers will point to... the same words, jumbled together.

You need to make it so (w + n)->str points to fresh memory, and to do that you can just

// Make str point to a COPY of split.
(w + n)->str = strdup(split);

As @Barmar noted, if you then deallocate your structures, you should also take care of freeing those pointers one by one (depending on the particular malloc library, you might find it useful to free them in reverse order to ensure memory doesn't get fragmented).

Another, less wasteful possibility is to only run one malloc, to duplicate the whole still untokenized line (then the malloc initial pointer will always be the one pointed by w[0]->str) and then unleash strtok on the copied string.

The memory will then look like (# are zeroes, MALLOCHDR is the malloc bookkeeping overhead)

 [MALLOCHDR]This_is_a_sentence_[SLACK]
 w[0]->str -^
 w[1]->str ------^
 w[2]->str ---------^
 w[3]->str -----------^

instead of one strdup (and therefore, malloc) per word:

 [MALLOCHDR]This_[SLACK][MALLOCHDR]is_[SLACK][MALLOCHDR]a_[SLACK}...
 w[0]->str -^
 w[1]->str ------------------------^
 w[2]->str ---------------------------------------------^

(the SLACK bytes are due to the fact that malloc usually actually allocates memory in chunks somewhat larger than the amount you request - I believe they're usually at least multiples of 8 bytes, maybe even 16. This kind of details is usually poorly documented and implementation dependent, but you'll want to be aware of them when you browse through 1,000 5-bytes words, and you wonder how you could occupy half a megabyte of RAM - or why after one hour of malloc'ing and free'ing the same bytes over and over, you abruptly somehow run out of memory. I've done both, more than once).

C - Prevent malloc() from writing over used memory

The Issue

Answers (1)

Related Questions