kristian williams
kristian williams

Reputation: 21

trouble with text reading and analysing program

hope everyone is well. any help would be greatly appreciated :)

I'm basically trying to write a c program which asks for and takes user input for the names of 2 separate txt files.

This program should then exclude specific characters from scanning the txt files using the header file exclude.h .

The first txt file taken contains a main text, the second contains keywords which need to be compared to the other txt file.

This code needs to determine how many times each word in the 2nd txt file is present within the 1st txt file. It needs to be case sensitive.

The frequency of each word within the keywords file should be stored within separate arrays.

Finally the name of each word and the number of occurrences should be printed, one at a time, in descending order.

the 1st txt file: https://docs.google.com/document/d/11MXPUHthb-gplv0w6WNz7k7ZAVfDwVNl5TcCiRyXt6k/edit?usp=sharing

the 2nd txt file: https://docs.google.com/document/d/1QsPRfGygXyq5Pr5_9oo4H204-Y3SdUad7GA5vY4Lsu4/edit?usp=sharing

Here's my code:

The header file (exclude.h):

#ifndef EXCLUDE_H
#define EXCLUDE_H

// Define punctuation characters to be excluded
#define EXCLUDE_CHARS "!\"$(),-./:;?\`"

#endif // EXCLUDE_H

the purpose of the header file it to stop the program from reading punctuation marks in the texts as they may cause it to not read the examples next to a punctuation mark

The main code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "exclude.h"

#define MAX_WORD_LENGTH 100
#define MAX_WORDS 1000

void toLower(char *str) {
    for (int i = 0; str[i]; i++) {
        str[i] = tolower(str[i]);
    }
}

int isExcluded(char ch) {
    for (int i = 0; EXCLUDE_CHARS[i] != '\0'; i++) {
        if (EXCLUDE_CHARS[i] == ch) {
            return 1; // Character is excluded
        }
    }
    return 0; // Character is not excluded
}

int main() {
    char filename1[100], filename2[100];
    printf("Enter the name of the main text file: ");
    scanf("%s", filename1);
    printf("Enter the name of the keywords file: ");
    scanf("%s", filename2);

    FILE *file1 = fopen(filename1, "r");
    FILE *file2 = fopen(filename2, "r");

    if (file1 == NULL || file2 == NULL) {
        perror("Error opening files");
        return EXIT_FAILURE;
    }

    char word[MAX_WORD_LENGTH];
    char keywords[MAX_WORDS][MAX_WORD_LENGTH];
    int frequency[MAX_WORDS] = {0};
    int numKeywords = 0;

    // Read keywords from file2
    while (fscanf(file2, "%s", word) == 1) {
        toLower(word);
        strcpy(keywords[numKeywords], word);
        numKeywords++;
    }

    // Compare keywords with file1
    while (fscanf(file1, "%s", word) == 1) {
        toLower(word);

        // Remove excluded characters
        int len = strlen(word);
        int j = 0;
        for (int i = 0; i < len; i++) {
            if (!isExcluded(word[i])) {
                word[j++] = word[i];
            }
        }
        word[j] = '\0';

        // Check if the word is a keyword
        for (int i = 0; i < numKeywords; i++) {
            if (strcmp(word, keywords[i]) == 0) {
                frequency[i]++;
                break;
            }
        }
    }

    // Print word frequencies in descending order
    for (int i = 0; i < numKeywords; i++) {
        int maxIndex = 0;
        for (int j = 1; j < numKeywords; j++) {
            if (frequency[j] > frequency[maxIndex]) {
                maxIndex = j;
            }
        }
        if (frequency[maxIndex] > 0) {
            printf("%s: %d\n", keywords[maxIndex], frequency[maxIndex]);
            frequency[maxIndex] = 0; // Mark as printed
        }
    }

    fclose(file1);
    fclose(file2);

    return 0;
}

the output should look like this:

word: frequency:

word: frequency:

word: frequency:

word: frequency:

word: frequency:

word: frequency:

word: frequency:

word: frequency:

word: frequency:

word: frequency:

with the word with the highest frequency at the top and the rest displayed in descending order.

im unsure why it isnt working. thanks again for any advice/help :)

Upvotes: 0

Views: 46

Answers (1)

David C. Rankin
David C. Rankin

Reputation: 84569

You are thinking along the correct lines, but there are a few things that can make life much easier. You first want to read file2 and store the words in a struct that holds a word and a count for the frequency. This makes this easier from a storage standpoint. You simply need an array-of-struct that will hold your word from file2 with the counter (frequency) initialized zero to begin with.

You then loop over the words in file1, and since you have to split the word on the collection of EXCLUDE_CHARS, then the strtok() function will make removing the characters and processing each word (or part of a word) on either side of the punctuation easy1. The strtok() function will take EXCLUDE_CHARS as the delimiters between words (tokens) and you simply loop calling strtok() on the whitespace separated string read from file1 to process each word removing the delimiters.

strtok() will treat any sequence of the characters in EXCLUDE_CHARS as a single delimiter, removing all at once. To use strtok() note you call it with the string as the first parameter on the first call and replace it will NULL for all subsequent calls. strtok() does modify the input, so make a copy if you need to preserve the input string.

With your words from file2 read into your collection of struct (in wf as in the answer to your last question), your read-loop for file1 is reduced to:

int main (int argc, char **argv) {

  char buf[MAXWORDCHAR];    /* buffer to hold each word in file */
  wordfreq *wf = NULL;      /* pointer to allocated collection of struct */
  size_t  nwords = 0;       /* no. of words in file2 */
  /* use filenames provided as 1st & 2nd arguments */
  FILE *fp = NULL;
  ...
  /* loop reading each word in file1 */
  while (fscanf (fp, "%1023s", buf) == 1) {
    /* split buf on EXCLUDE_CHARS processing each part at a word */
    for (char *p = strtok (buf, EXCLUDE_CHARS); 
          p; 
          p = strtok (NULL, EXCLUDE_CHARS)) {
      
      /* check if word in struct */
      int index = wordexists (p, wf, nwords);
      if (wf && index >= 0) {       /* if word exists */
        wf[index].freq += 1;        /* increment freq for word */
      }
    }
  }
  fclose (fp);              /* close file1 */

Note, if a while() loop makes using strtok() easier for you to read rather than putting it altogether in a for() loop, you can do that with:

  ...
    char *p = strtok (buf, EXCLUDE_CHARS);      /* call with string 1st call */
  
    while (p != NULL) {
      /* check if word in struct */
      int index = wordexists (p, wf, nwords);
      if (wf && index >= 0) {       /* if word exists */
        wf[index].freq += 1;        /* increment freq for word */
      }
    
      p = strtok (NULL, EXCLUDE_CHARS);         /* call with NULL thereafter */
    }

Now all you have to sort out is how you want to provide storage for your word frequency structs and the strings you read from file2. While you can use a fixed array size to hold each of the words, that tends to require quite a bit more storage than you actually use. In you last question, my understanding was you were storing all the unique words in file1 (some 54,300+ words) with reserving 1000 characters for each 6-character word (on average) adds up.

Alternatively, you can allocate only the bytes you need to store each word (+1 for the nul-terminating character). It is a few more lines of code, but is an alternative to consider.

Putting all the pieces together, (with the sorting descending as discussed in your last question, allocating for the structs, and for each word read from file2, and moving the code that scans file1 for frequency of words from file2 into a function find_word_frequencies() to tidy up main(), you can do something similar to the following:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXWORDCHAR     1024
#define EXCLUDE_CHARS   "!\"$(),-./:;?`"

typedef struct {
  char *word;
  size_t freq;
} wordfreq;


/* qsort compare function for sorting keyword frequencies
 * using (a < b) - (a > b) for descending sort to avoid potential overflow.
 * for ascending sort use: (a > b) - (a < b)
 */
int cmpwordfreq (const void *a, const void *b) {
  const wordfreq  *x = a,
                  *y = b;

  /* sort alphabetically if frequencies are equal */
  if (x->freq == y->freq) {
    return (strcmp (x->word, y->word));
  }

  /* sort by frequency */
  return (x->freq < y->freq) - (x->freq > y->freq);
}


/* check if word exists in allocated wordfreq structs
 * returns index of word if it exits, -1 otherwise.
 */
int wordexists (const char *word, const wordfreq *wf, size_t n)
{
  /* validate both word and wf not NULL */
  if (!word || !wf) {
    return -1;    /* no word in struct */
  }

  /* loop over each struct comparing word stored with word */
  for (size_t i = 0; i < n; i++) {
    if (strcmp (word, wf[i].word) == 0) {
      return i;   /* return index if found */
    }
  }

  return -1;      /* no word in struct */
}


/* read file into allocated collection of struct wordfreq,
 * takes pointer to size_t as 2nd parameter to make number of words 
 * available to caller. Returns address of allocated collection on success,
 * NULL otherwise. 
 */
wordfreq *read_file2_words (FILE *fp, size_t *nwords)
{
  char buf[MAXWORDCHAR];    /* buffer to hold each word in file */
  wordfreq *wf = NULL;      /* pointer to allocated collection of struct */
  size_t  allocated = 0,    /* no. of pointers allocated */
          used = 0;         /* no. of pointers used */
  
  /* loop reading each word in file */
  while (fscanf (fp, "%1023s", buf) == 1) {
    size_t len = 0;         /* length of buf */
    
    /* if more pointers needed, realloc more */
    if (used == allocated) {
      /* always realloc to a temporary pointer */
      void *tmp = realloc (wf, (allocated ? 2 * allocated : 2) * sizeof *wf);
      /* validate realloc return */
      if (!tmp) {                       /* if realloc failed */
        perror ("realloc-p-failed");    /* output error */
        if (wf == NULL) {               /* check if 1st allocation */
          return NULL;                  /* return failure */
        }
        break;                          /* break read loop */
      }
      wf = tmp;                                   /* assign realloced block */
      allocated = allocated ? allocated * 2 : 2;  /* update no. allocated */
    }
    
    /* allocate storage for word */
    len = strlen (buf);                         /* get length of buf */
    wf[used].word = malloc (len + 1);           /* allocate len + 1 bytes */
    if (wf[used].word == NULL) {                /* validate */
      perror ("malloc-p[used]");
      break;
    }
    
    memcpy (wf[used].word, buf, len + 1);       /* copy buf to struct */
    wf[used++].freq = 0;                        /* initialize freq 0 */
  }
  
  /* (**optional**) final realloc to exact no. of struct */
  void *tmp = realloc (wf, used * sizeof *wf);
  if (tmp) {        /* if realloc succeeded, update wf */
    wf = tmp;
  }
  
  *nwords = used;   /* update value at nwords to no. of pointers used */
  
  return wf;        /* return allocated collection of pointers & strings */
}


/* read file1 and determine frequency of each word contained in 
 * collection of word-frequency structs filled from file2. Returns
 * non-zero on success, zero on failure.
 */
int find_word_frequencies (wordfreq *wf, size_t nwords, FILE *fp)
{
  char buf[MAXWORDCHAR];    /* buffer to hold each word in file */
  
  /* loop reading each word in file1 */
  while (fscanf (fp, "%1023s", buf) == 1) {
    /* split buf on EXCLUDE_CHARS processing each part as a word */
    for (char *p = strtok (buf, EXCLUDE_CHARS); 
          p; 
          p = strtok (NULL, EXCLUDE_CHARS)) {
      
      /* check if word in struct */
      int index = wordexists (p, wf, nwords);
      if (wf && index >= 0) {       /* if word exists */
        wf[index].freq += 1;        /* increment freq for word */
      }
    }
  }
  
  return feof (fp);     /* non-zero if EOF reached, 0 otherwise */
}


/* free allocated strings and structs */
void free_wf (wordfreq *wf, size_t n)
{
  /* loop over each struct freeing string */
  for (size_t i = 0; i < n; i++) {
    free (wf[i].word);
  }
  
  free (wf);    /* free allocated structs */
}


int main (int argc, char **argv) {

  wordfreq *wf = NULL;      /* pointer to allocated collection of struct */
  size_t  nwords = 0;       /* no. of words in file2 */
  /* use filenames provided as 1st & 2nd arguments */
  FILE *fp = NULL;
  
  if (argc < 3) { /* validate 2 arguments given */
    puts ("error: insufficient number of arguments provided\n"
          "usage: ./program file1 file2");
    return 1;
  }
  
  fp = fopen (argv[2], "r");      /* open file2 first */
  if (!fp) {                      /* validate open for reading */
    perror ("fopen-file2");
    return 1;
  }
  
  /* read file2 words, allocating for each struct and string */
  wf = read_file2_words (fp, &nwords);
  if (wf == NULL || nwords == 0) {
    puts ("read of file2 words failed");
    return 1;
  }
  fclose (fp);              /* close file2 */
  
  /* open file1 and validate it is open for reading */
  if ((fp = fopen (argv[1], "r")) == NULL) {
    perror ("fopen-file1");
    return 1;
  }
  
  /* find frequency of each word in file2 in file1 */
  if (find_word_frequencies (wf, nwords, fp) == 0) {
    fputs ("error: file1 EOF not reachesd.\n", stdout);
    return 1;
  }
  fclose (fp);              /* close file1 */

  /* sort desencing by frequency
   * then alphabetically by name if frequencies equal
   */
  qsort (wf, nwords, sizeof *wf, cmpwordfreq);

  /* Print the results in a table format */
  printf ("Unique words: %zu\n\n"
          "%-20s Frequency\n", nwords, "Keywords");
  for (size_t i = 0; i < nwords; i++) {
    printf ("%-20s %zu\n", wf[i].word, wf[i].freq);
  }

  free_wf (wf, nwords);   /* don't forget to free what you allocated */
}

Example Use/Output

With Hamlet in file1 and your words to categorize by frequency in file2 and providing file1 and file2 as the first two arguments to the program, you would get:

$ ./program file1 file2
Unique words: 10

Keywords             Frequency
common               8
euery                8
gaue                 7
Thankes              5
Vnkle                4
Day                  3
Soft                 3
growes               3
wag                  2
seal                 1

(you can remove the unique-word count -- up to you)

Also, as discussed in your last question, always use a memory checker to verify there are no memory errors and you have freed all memory you have allocated, e.g.

$ valgrind ./program file1 file2
==17959== Memcheck, a memory error detector
==17959== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==17959== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==17959== Command: ./program file1 file2
==17959==
Unique words: 10

Keywords             Frequency
common               8
euery                8
gaue                 7
Thankes              5
Vnkle                4
Day                  3
Soft                 3
growes               3
wag                  2
seal                 1
==17959==
==17959== HEAP SUMMARY:
==17959==     in use at exit: 0 bytes in 0 blocks
==17959==   total heap usage: 20 allocs, 20 frees, 10,857 bytes allocated
==17959==
==17959== All heap blocks were freed -- no leaks are possible
==17959==
==17959== For lists of detected and suppressed errors, rerun with: -s
==17959== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

There are other ways to approach the problem, but "tokenizing" the words read from file1 based on EXCLUDE_CHARS is about a straight-forward as any. Let me know if you have questions.

footnotes:

  1. You still need to determine what rule you will use for hyphenated words, but since file2 contains no hyphenated words, that isn't an issue this time.

Upvotes: 0

Related Questions