Binary searching a dictionary of words

Question

In this code:

I read the content of the file ~/usr/share/dict/word and stored them in array.
Then started to do a binary search algorithm on this array but the problem is after passing the array to the binary search function in line 62 and try to compare it to the key in the binary_search(string* dictionary, string key) method.
I found out it was comparing the key to this unknown string "��tudes" for some reason I don't know.
I am sure that the array contains the right data though.

Code:

#include 
#include 
#include 

#define MAX 99171

// Prototype //
int binary_search(string*, string);

int main(int argc, string argv[])
{
    // Attributes // 
    string dictionary[MAX];
    FILE* dictionaryFile = fopen("words", "r");
    char output[256];
    string key = argv[1];

    // Check if their is a problem while reading the file //
    if (dictionaryFile == NULL)
    {
        // If everything got fouled up then close the file // 
        fclose(dictionaryFile);
        printf("couldn't read the file!!!
");
        return 1;
    }

    // storing the information into an array to make it easy to read //
    for(int i = 0; i < MAX; i++)
    { 
        fgets(output, sizeof(output), dictionaryFile); 
        dictionary[i] = output;
    }

    // Binary Search a word //
    if(binary_search(dictionary, key) == 1)
    {
        printf("word was found !!!
");
    }
    else if(binary_search == 0)
    {
        printf("word was not found !!!
");
    }

    // If Everything goes just fine close the file //
    fclose(dictionaryFile);
    return 0;
}


// implementing prototype //

/**
    @arag dictionary 
        a string of english words 

    @arg key 
        a key we looking for

    @return 
        0 if didn't find the key and 1 otherwise
*/
int binary_search(string* dictionary, string key)
{
    // pointer to the start and the end of the array //
    int start = 0;
    int end = MAX - 1;
    int mid;

    // while end is greater than the start //
    while (end > start)
    {
        // Get The Middle Element //
        mid = (start + end) / 2;
        printf("%s
", dictionary[mid]);

        // Check if the middle elemenet //
        if (strcmp(key, dictionary[mid]) == 0)
        {
            return 1;
        }

        // Check the left half //
        else if(strcmp(key, dictionary[mid]) < 0)
        {
            end = mid - 1;
        }

        // Check the right half //
        else if (strcmp(key, dictionary[mid]) > 0)
        {
            start = mid + 1;
        }
    }
    // didn't find the key //
    return 0;

}

Note: the cs50.h library is made by Harvard as a training wheel for beginners like me and I am using it in my code and this is a link to its reference.

M Oehm · Accepted Answer

The cs50.h library is made by Harvard as a training wheel for beginners.

If so, these training wheels are mounted upside down and don't touch the ground. I can't tell from your link, but I think that

typedef char *string;

is part of the cs50 suite. But there are no strings in C; the expression is used loosely to mean an array of chars that is terminated with a null character, '\0'.

The above definition of string has led you to believe that string is a proper type whose memory is automatically handled. It isn't. In your program there is place for one string, namely the array

char output[256];

The "strings" in your dictionary are just pointers; they are supposed to point to existing char arrays or to be NULL. By assigning

dictionary[i] = output;

you make all strings in your dictionary equal to the temporary buffer output. That buffer is overwritten in each line you read and will contain only the last line you've read, maybe "zulu".

You can confirm this by printing out the dictionary after you have read it. You should print it in a separate loop, not in the same loop where you read it to see the effect.

You can fix this by declaring your array of pointers as an array of arrays of char:

char dictionary[MAX][LEN];

where LEN is a suitable maximum length for words, say 24. (The problem here might be that the allocated memory, MAX * LEN bytes may not fit onto the stack. In that case, you must allocate the memory on the heap with malloc. I'm not going to open that can of worms here. If you get a segmentation violation immediately, try reducing MAX at the cost of reading only a part of the dictionary.)

When you read the words, you must copy the contents:

fgets(output, sizeof(output), dictionaryFile); 
strncpy(dictionary[i], output, sizeof(dictionary[i]);

or, better yet, read the next word directly into the dictionary:

fgets(dictionary[i], sizeof(dictionary[i]), dictionaryFile);

Unfortunately, fgets retains the newline at the end, so that it reads "word " instead of "word". You must remove the newline or the strings won't match the input, which comes from the command line via argv, which has no trailing newlines.

There are several ways to chomp off the unwanted newline. An easy one is to tokenise the string with a newline as separator:

strtok(dictionary[i], "
");

Another problem is that with the new definition of dictionary, your signature for binary_search is wrong. You don't have an array of pointers to char anymore, you have an array of arrays of 24 (or so, a fixed number anyways) chars. Change it to:

int binary_search(char dictionary[][LEN], const char *key)

In C, if you have arrays of arrays (of arrays, even), all but the topmost dimension must be known, so that the compiler can lay out the memory.

There are other (rather minor) problems:

You try to fclose the file if it couldn't be opened. Bu when the file is NULL, you don't have an open file to close; just exit.
You should enforce that there is at least one argument, otherwise you might loop up a null key, which will lead to undefined behaviour (that is, very likely a crash) when you try to compare it.
When you read in the words, don't rely on a hard-coded word count. You don't know how many words are in the file. Check the return value of fgets; it returns NULL when the file has run out. MAX is a good way to estimate the number of words, but you should keep the actual number of words read in a variable. Make sure that you don't access more words than you have read and make sure that you don't write beyond the memory that you have allocated, i.e. don't read more than MAX words.
When you don't have a hard-coded word count, you should make that count a parameter of your binary_search function.
In the "not found" beanch, your test is else if(binary_search == 0). First, else aleady means that the binary search didn't return 1 (which was the condition the else refers to) and the binary search can return only 0 and 1, so there's no need for another condition. Second, binary_search is only the address of the function, not a result; the consition as written above will always be true.
The same goes for the strcmp calls in the binary search function: You do three comparisons. The outcomes you check are mutually exclusive, so the last condition can be just else. (Because strcmp does a character-by-character comparison every time, it might be worth calling strcmp just once per word and store the result.)

The string data type in the cs50 header is meant to provide an easy means to read in strings without having to care about the memory. Once you start creating more complex (aka real-life) data structures, it is better to use char arrays and pointers. There's no way around this anyway and you get to see what each piece of data is.

I'm sorry that my answer looks like a laundry list of errors. C's string handling is not very easy for beginners, especially not if you have already experience in higher-level languages. The good thing is that when you understand C strings, you already know a lot about how things are done in C in general.

Binary searching a dictionary of words

Answers (1)

Related Questions