Nate
Nate

Reputation: 65

Count the number of occurrences of each word

I'm trying to count the number of occurrences of each word in the function countWords I believe i started the for loop in the function properly but how do I compare the words in the arrays together and count them and then delete the duplicates? Isn't it like a fibonacci series or am I mistaken? Also int n has the value of 756 because thats how many words are in the array and wordsArray are the elements in the array.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

int *countWords( char **words, int n);
int main(int argc, char *argv[])
{
  char buffer[100];  //Maximum word size is 100 letters
  FILE *textFile;
  int numWords=0;
  int nextWord;
  int i, j, len, lastChar;
  char  *wordPtr;
  char  **wordArray;
  int *countArray;
  int *alphaCountArray;
  char **alphaWordArray;
  int *freqCountArray;
  char **freqWordArray;
  int choice=0;

  //Check to see if command line argument (file name)
  //was properly supplied.  If not, terminate program
  if(argc == 1)
  {
    printf ("Must supply a file name as command line argument\n");
    return (0);
  }

  //Open the input file.  Terminate program if open fails
  textFile=fopen(argv[1], "r");
  if(textFile == NULL)
  {
    printf("Error opening file. Program terminated.\n");
    return (0);
  }

  //Read file to count the number of words
  fscanf(textFile, "%s", buffer);
  while(!feof(textFile))
  {
    numWords++;
    fscanf(textFile, "%s", buffer);
  }

  printf("The total number of words is: %d\n", numWords);
  //Create array to hold pointers to words
  wordArray = (char **) malloc(numWords*sizeof(char *));
  if (wordArray == NULL)
  {
     printf("malloc of word Array failed.  Terminating program.\n");
     return (0);
  }
  //Rewind file pointer and read file again to create
  //wordArray
  rewind(textFile);
  for(nextWord=0; nextWord < numWords; nextWord++)
  {
    //read next word from file into buffer.
    fscanf(textFile, "%s", buffer);

    //Remove any punctuation at beginning of word
    i=0;
    while(!isalpha(buffer[i]))
    {
      i++;
    }
    if(i>0)
    {
      len = strlen(buffer);
      for(j=i; j<=len; j++)
      {
        buffer[j-i] = buffer[j];
      }
    }

    //Remove any punctuation at end of word
    len  = strlen(buffer);
    lastChar = len -1;
    while(!isalpha(buffer[lastChar]))
    {
      lastChar--;
    }
    buffer[lastChar+1] = '\0';

    //make sure all characters are lower case
    for(i=0; i < strlen(buffer); i++)
    {
      buffer[i] = tolower(buffer[i]);
    }

    //Now add the word to the wordArray.
    //Need to malloc an array of chars to hold the word.
    //Then copy the word from buffer into this array.
    //Place pointer to array holding the word into next
    //position of wordArray
    wordPtr = (char *) malloc((strlen(buffer)+1)*sizeof(char));
    if(wordPtr == NULL)
    {
      printf("malloc failure.  Terminating program\n");
      return (0);
    }
    strcpy(wordPtr, buffer);
    wordArray[nextWord] = wordPtr;
  }

  //Call countWords() to create countArray and replace
  //duplicate words in wordArray with NULL
  countArray = countWords(wordArray, numWords);
  if(countArray == NULL)
  {
    printf("countWords() function returned NULL; Terminating program\n");
    return (0);
  }

  //Now call compress to remove NULL entries from wordArray
  compress(&wordArray, &countArray, &numWords);
  if(wordArray == NULL)
  {
    printf("compress() function failed; Terminating program.\n");
    return(0);
  }
  printf("Number of words in wordArray after eliminating duplicates and compressing is: %d\n", numWords);

  //Create copy of compressed countArray and wordArray and then sort them alphabetically
  alphaCountArray = copyCountArray(countArray, numWords);
  freqCountArray = copyCountArray(alphaCountArray, numWords);
int *countWords( char **wordArray, int n)
{
  return NULL;
  int i=0;
  int n=0;

  for(i=0;i<n;i++)
  {
      for(n=0;n<wordArray[i];n++)
      {

      }
   }

}

Upvotes: 4

Views: 799

Answers (2)

Craig Estey
Craig Estey

Reputation: 33631

I'm going to throw you a bit of a curve ball here.

Rather than fix your code, which can be easily fixed as it's pretty good on its own, but incomplete, I decided to write an example from scratch.

No need to read the file twice [first time just to get the maximum count]. This could be handled by a dynamic array and realloc.

The main point, I guess, is that it is much easier to ensure that word list has no duplicates while creating it, rather than removing duplicates at the end.

I opted for a few things.

I created a "word control" struct. You've got several separate arrays that are indexed the same way. That, sort of, "cries out" for a struct. That is, rather than [say] 5 separate arrays, have a single array of a struct that has 5 elements in it.

The word list is a linked list of these structs. It could be a dynamic array on the heap that gets realloced instead, but the linked list is actually easier to maintain for this particular usage.

Each struct has the [cleaned up] word text and a count of the occurrences (vs. your separate wordArray and countArray).

When adding a word, the list is scanned for an existing match. If one is found, the count is incremented, rather than creating a new word list element. That's the key to eliminating duplicates [i.e. don't create them in the first place].

Anyway, here it is:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <errno.h>

#define sysfault(_fmt...) \
    do { \
        printf(_fmt); \
        exit(1); \
    } while (0)

// word control
typedef struct word {
    struct word *next;              // linked list pointer
    char *str;                      // pointer to word string
    int count;                      // word frequency count
} word_t;

word_t wordlist;                    // list of words

// cleanword -- strip chaff and clean up word
void
cleanword(char *dst,const char *src)
{
    int chr;

    // NOTE: using _two_ buffers in much easier than trying to clean one
    // buffer in-place
    for (chr = *src++;  chr != 0;  chr = *src++) {
        if (! isalpha(chr))
            continue;
        chr = tolower(chr);
        *dst++ = chr;
    }

    *dst = 0;
}

// addword -- add unique word to list and keep count of number of words
void
addword(const char *str)
{
    word_t *cur;
    word_t *prev;
    char word[1000];

    // get the cleaned up word
    cleanword(word,str);

    // find a match to a previous word [if it exists]
    prev = NULL;
    for (cur = wordlist.next;  cur != NULL;  cur = cur->next) {
        if (strcmp(cur->str,word) == 0)
            break;
        prev = cur;
    }

    // found a match -- just increment the count (i.e. do _not_ create a
    // duplicate that has to be removed later)
    if (cur != NULL) {
        cur->count += 1;
        return;
    }

    // new unique word
    cur = malloc(sizeof(word_t));
    if (cur == NULL)
        sysfault("addword: malloc failure -- %s\n",strerror(errno));

    cur->count = 1;
    cur->next = NULL;

    // save off the word string
    cur->str = strdup(word);
    if (cur->str == NULL)
        sysfault("addword: strdup failure -- %s\n",strerror(errno));

    // add the new word to the end of the list
    if (prev != NULL)
        prev->next = cur;

    // add the first word
    else
        wordlist.next = cur;
}

int
main(int argc,char **argv)
{
    FILE *xf;
    char buf[1000];
    char *cp;
    char *bp;
    word_t *cur;

    --argc;
    ++argv;

    xf = fopen(*argv,"r");
    if (xf == NULL)
        sysfault("main: unable to open '%s' -- %s\n",*argv,strerror(errno));

    while (1) {
        // get next line
        cp = fgets(buf,sizeof(buf),xf);
        if (cp == NULL)
            break;

        // loop through all words on a line
        bp = buf;
        while (1) {
            cp = strtok(bp," \t\n");
            bp = NULL;

            if (cp == NULL)
                break;

            // add this word to the list [avoiding duplicates]
            addword(cp);
        }
    }

    fclose(xf);

    // print the words and their counts
    for (cur = wordlist.next;  cur != NULL;  cur = cur->next)
        printf("%s %d\n",cur->str,cur->count);

    return 0;
}

Upvotes: 1

The Dark
The Dark

Reputation: 8514

Assuming you want the return value of countWords to be an array of integers with word counts for each unique word, you need to have a double loop. One loop goes over the whole array, the second loop goes through the rest of the array (after the current word), looking for duplicates.

You could do something like this pseudo code:

Allocate the return array countArray (n integers) 
Loop over all words (as you currently do in your `for i` loop)
   If the word at `i` is not null // Check we haven't already deleted this word
      // Found a new word
      Set countArray[i] to 1
      Loop through the rest of the words e.g. for (j = i + 1; j < n; j++)
         If the word at j is not NULL and matches the word at i (using strcmp)
            // Found a duplicate word
            Increment countArray[i] (the original word's count)
            // We don't want wordArray[j] anymore, so 
            Free wordArray[j]
            Set wordArray[j] to NULL
   Else
      // A null indicates this was a duplicate, set the count to 0 for consistency.
      Set countArray[i] to 0
Return wordArray

Upvotes: 1

Related Questions