Reputation: 359

How to Delete Duplicate Elements from Dynamically Allocated String Array in C

I have created a program in C that reads in a word file and counts how many words are in that file, along with how many times each word occurs.

When I run it through Valgrind I either get too many bytes lost or a Segmentation Fault.

How can I remove a duplicate element from a dynamically allocated array and free the memory as well?

Gist: wordcount.c

int tokenize(Dictionary **dictionary, char *words, int total_words)
{

    char *delim = " .,?!:;/\"\'\n\t";
    char **temp = malloc(sizeof(char) * strlen(words) + 1);
    char *token = strtok(words, delim);

    *dictionary = (Dictionary*)malloc(sizeof(Dictionary) * total_words);

    int count = 1, index = 0;

    while (token != NULL)
    {
        temp[index] = (char*)malloc(sizeof(char) * strlen(token) + 1);
        strcpy(temp[index], token);

        token = strtok(NULL, delim);

        index++;
    }

    for (int i = 0; i < total_words; ++i)
    {
        for (int j = i + 1; j < total_words; ++j)
        {
            if (strcmp(temp[i], temp[j]) == 0) // <------ segmentation fault occurs here
            {
                count++;

                for (int k = j; k < total_words; ++k) // <----- loop to remove duplicates
                    temp[k] = temp[k+1];

                total_words--;
                j--;
            }
        }


        int length = strlen(temp[i]) + 1;
        (*dictionary)[i].word = (char*)malloc(sizeof(char) * length);

        strcpy((*dictionary)[i].word, temp[i]);
        (*dictionary)[i].count = count;

        count = 1;
    }

    free(temp);
    return 0;
}

Thanks in advance.

Upvotes: 0

Answers (2)

David C. Rankin

Reputation: 84559

Without A Minimal, Complete, and Verifiable example, there is no guarantee that additional problems do not originate elsewhere in your code, but the following need careful attention:

    char **temp = malloc(sizeof(char) * strlen(words) + 1);

Above you are allocating pointers not words, your allocation is too small by a factor of sizeof (char*) - sizeof (char). To prevent such problems, if you use the sizeof *thepointer, you will always have the correct size, e.g.

    char **temp = malloc (sizeof *temp * strlen(words) + 1);

(unless you plan on providing a sentinel NULL as the final pointer, then + 1 is unnecessary. You must also validate the return (see below))

    *dictionary = (Dictionary*)malloc(sizeof(Dictionary) * total_words);

There is no need to cast the return of malloc, it is unnecessary. See: Do I cast the result of malloc?. Further, if *dictionary was previously allocated elsewhere, the allocation above creates a memory leak because you lose the reference to the original pointer. If it has been previously allocated, you need realloc, not malloc. And if wasn't allocate, a better way of writing it would be:

    *dictionary = malloc (sizeof **dictionary * total_words);

You must also validation the allocation succeeds before attempting to use the block of memory, e.g.

    if (! *dictionary) {
        perror ("malloc - *dictionary");
        exit (EXIT_FAILURE);
    }

In:

        temp[index] = (char*)malloc(sizeof(char) * strlen(token) + 1);

sizeof(char) is always 1 and can be omitted. Better written as:

        temp[index] = malloc (strlen(token) + 1);

or better, allocate and validate in a single block:

        if (!(temp[index] = malloc (strlen(token) + 1))) {
            perror ("malloc - temp[index]");
            exit (EXIT_FAILURE);
        }

then

        strcpy(temp[index++], token);

Next, while total_words may be equal to the words in temp, you have only validated that you have index number of words. That combined with your original allocation times sizeof (char) instead of sizeof (char *), makes it no wonder there can be segfaults where you attempt to iterate over your list of pointers in temp. Better:

     for (int i = 0; i < index; ++i)
    {
        for (int j = i + 1; j < index; ++j)

(the same applies to your k loop as well. Additionally, since you have allocated each temp[index], when you shuffle pointers with temp[k] = temp[k+1]; you overwrite the pointer address in temp[k] causing a memory leak with every pointer you overwrite. Each temp[k] that is overwritten should be freed before the assignment is made.

While you are updating total_words--, there still to this point has never been a validation that index == total_words, and in the event they are not, you can have no confidence in total_words or that you won't segfault attempting to iterate over uninitialized pointers as the result.

The rest appears workable, but after changes are made above, you should insure that the are no additional changes needed. Look things over and let me know if you need additional help. (and with a MCVE, I'm happy to help further)

Additional Problems

I apologize for the delay, real-world called -- and this took a lot longer than anticipated, because what you have is an awkward slow-motion logical train-wreck. First and foremost, while there is nothing wrong with reading an entire text-file file into a buffer with fread -- the buffer is NOT nul-terminated and therefore cannot be used with any functions expecting a string. Yes, strtok, strcpy or any string function will read past the end of word_data looking for the nul-terminating character (well out into memory you don't own) resulting in a SegFault.

Your various scattered +1 tacked onto your malloc allocations now make a little more sense, as it appears you were looking for where you needed to add an additional character to make sure you could nul-terminate word_data, but couldn't quite figure out where it went. (don't worry, I straightened that out for you, but it is a big hint that you are probably going about this in the wrong way -- reading with POSIX getline or fgets is probably a better approach than the file-at-once for this type of text processing)

That is literally, just the tip of the iceberg in the problems encountered in your code. As hinted at earlier, in tokenize, you failed to validate that index equals total_words. This ends up being important given your choice of delim which includes the ASCII apostrophe (or single-quote). This causes your index to exceed the word_count any time a plural-possessive or contraction is encountered in the buffer (e.g. "can't" is split is "can" and "t", "Peter's" is split into "Peter" and "s", etc.... You will have to decide how you want to resolve this, I have simply removed the single quote for now.

Your logic in both tokenize and count_words was difficult to follows, and just wrong in some aspects, and your return type (void) for read_file provided absolutely no way to indicate a success (or failure) within. Always choose a return type that provides meaningful information from which you can determine is a critical function has succeeded or failed (reading your data qualifies as critical).

If it provides a return -- use it. This applies to all functions that can fail (including functions like fseek)

Returning 0 from tokenize misses the return of the number of words (allocated struts) in dictionary leaving you unable to properly free the information and leaving you to guess at some number to display (e.g. for (int i = 0; i < 333; ++i) in main()). You need to track the number of dictionary structs and member word that are allocated in tokenize (keep an index, say dindex). Then returning dindex to main() (assigned to hello in your code) provides the information you need to iterate over the structs in main() to output your information, as well as to free each allocated word before freeing the pointers.

If you don't have an accurate count of the number of allocated dictionary structs back in main(), you have failed in the two responsibilities you have regarding any block of memory allocated: (1) always preserve a pointer to the starting address for the block of memory so, (2) it can be freed when it is no longer needed. If you don't know how many blocks there are, then you haven't done (1) and can't do (2).

This is a nit about style, and while not an error, the standard coding style for C avoids the use of Initialcaps, camelCase or MixedCase variable names in favor of all lower-case while reserving upper-case names for use with macros and constants. It is a matter of style -- so it is completely up to you, but failing to follow it can lead to the wrong first impression in some circles.

Rather than carry on for another handful of paragraphs, I've reworked your example for you and added a few comments inline. Go though it, I haven't punishingly tested it for all corner-cases, but it should be a sound base to build from. You will note in going though it, your count_words and tokenize have been simplified. Try and understand why what was done, was done, and ask if you have any questions:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <errno.h>

typedef struct{
    char *word;
    int count;
} dictionary_t;

char *read_file (FILE *file, char **words, size_t *length)
{
    size_t size = *length = 0;

    if (fseek (file, 0, SEEK_END) == -1) {
        perror ("fseek SEEK_END");
        return NULL;
    }
    size = (size_t)ftell (file);

    if (fseek (file, 0, SEEK_SET) == -1) {
        perror ("fseek SEEK_SET");
        return NULL;
    }

    /* +1 needed to nul-terminate buffer to pass to strtok */
    if (!(*words = malloc (size + 1))) {
        perror ("malloc - size");
        return NULL;
    }

    if (fread (*words, 1, size, file) != size) {
        perror ("fread words");
        free (*words);
        return NULL;
    }

    *length = size;
    (*words)[*length] = 0;  /* nul-terminate buffer - critical */

    return *words;
}

int tokenize (dictionary_t **dictionary, char *words, int total_words)
{
    // char *delim = " .,?!:;/\"\'\n\t";    /* don't split on apostrophies */
    char *delim = " .,?!:;/\"\n\t";
    char **temp = malloc (sizeof *temp * total_words);
    char *token = strtok(words, delim);
    int index = 0, dindex = 0;

    if (!temp) {
        perror ("malloc temp");
        return -1;
    }

    if (!(*dictionary = malloc (sizeof **dictionary * total_words))) {
        perror ("malloc - dictionary");
        return -1;
    }

    while (token != NULL)
    {
        if (!(temp[index] = malloc (strlen (token) + 1))) {
            perror ("malloc - temp[index]");
            exit (EXIT_FAILURE);
        }
        strcpy(temp[index++], token);

        token = strtok (NULL, delim);
    }

    if (total_words != index) { /* validate total_words = index */
        fprintf (stderr, "error: total_words != index (%d != %d)\n", 
                total_words, index);
        /* handle error */
    }


    for (int i = 0; i < total_words; i++) {
        int found = 0, j = 0;
        for (; j < dindex; j++)
            if (strcmp((*dictionary)[j].word, temp[i]) == 0) {
                found = 1;
                break;
            }
        if (!found) {
            if (!((*dictionary)[dindex].word = malloc (strlen (temp[i]) + 1))) {
                perror ("malloc (*dictionay)[dindex].word");
                exit (EXIT_FAILURE);
            }
            strcpy ((*dictionary)[dindex].word, temp[i]);
            (*dictionary)[dindex++].count = 1;
        }
        else
            (*dictionary)[j].count++;
    }

    for (int i = 0; i < total_words; i++)
        free (temp[i]);     /* you must free storage for words */
    free (temp);            /* before freeing pointers */

    return dindex;
}

int count_words (char *words, size_t length)
{
    int count = 0;
    char previous_char = ' ';

    while (length--) {
        if (isspace (previous_char) && !isspace (*words))
            count++;
        previous_char = *words++;
    }

    return count;
}

int main (int argc, char **argv)
{
    char *word_data = NULL;
    int word_count, hello;
    size_t length = 0;
    dictionary_t *dictionary = NULL;
    FILE *input = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!input) {   /* validate file open for reading */
        fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
        return 1;
    }

    if (!read_file (input, &word_data, &length)) {
        fprintf (stderr, "error: file_read failed.\n");
        return 1;
    }
    if (input != stdin) fclose (input); /* close file if not stdin */

    word_count = count_words (word_data, length);
    printf ("wordct: %d\n", word_count);

    /* number of dictionary words returned in hello */
    if ((hello = tokenize (&dictionary, word_data, word_count)) <= 0) {
        fprintf (stderr, "error: no words or tokenize failed.\n");
        return 1;
    }

    for (int i = 0; i < hello; ++i) {
        printf("%-16s : %d\n", dictionary[i].word, dictionary[i].count);
        free (dictionary[i].word);  /* you must free word storage */
    }
    free (dictionary);  /* free pointers */

    free (word_data);   /* free buffer */

    return 0;
}

Let me know if you have further questions.

Upvotes: 3

pgngp

Reputation: 1562

There are a few things that you need to do to make your code work:

Fix the memory allocation of temp by replacing sizeof(char) with sizeof(char *) like so:

char **temp = malloc(sizeof(char *) * strlen(words) + 1);
Fix the memory allocation of dictionary by replacing sizeof(Dictionary) with sizeof(Dictionary *):

*dictionary = (Dictionary*)malloc(sizeof(Dictionary *) * (*total_words));
Pass the address of address of word_count when calling tokenize:

int hello = tokenize(&dictionary, word_data, &word_count);
Replace all occurrences of total_words in tokenize function with (*total_words). In the tokenize function signature, you can replace int total_words with int *total_words.
You should also replace the hard-coded value of 333 in your for loop in the main function with word_count.

After you make these changes, your code should work as expected. I was able to run it successfully with these changes.

Upvotes: 1

How to Delete Duplicate Elements from Dynamically Allocated String Array in C

Answers (2)

Related Questions