kristian williams
kristian williams

Reputation: 21

problem with the frequencies output in c program which should compare txt file to keywords and output frequency

hope everyone is well! im a bit stuck on my programming assignment, the purpose of the c code is to compare a txt file (name given as user input) to another keywords txt file. then record the frequency of each word and print them in descending order. This should look like this:

euery 8
common 8
gaue 7
thankes 5
vnkle 4
growes 3
wag 3
seal 3
day 3
soft 3

sadly this isnt what my code produces. the code is as shown below:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_FILENAME_LENGTH 256
#define MAX_TEXT_LENGTH 100000
#define MAX_KEYWORDS 100

// Structure to store keyword frequencies
typedef struct KeywordFrequency {
    char keyword[50];
    int frequency;
} KeywordFrequency;

// Compare function for sorting keyword frequencies
int compareKeywords(const void *a, const void *b) {
    return ((KeywordFrequency *)b)->frequency - ((KeywordFrequency *)a)->frequency;
}

int main() {
    char filename[MAX_FILENAME_LENGTH];
    char text[MAX_TEXT_LENGTH];
    KeywordFrequency keywords[MAX_KEYWORDS];
    int totalKeywords = 0;

    // Ask the user for the name of the text file
    printf("Enter the name of the text file: ");
    scanf("%s", filename);

    // Open and read the text file
    FILE *textFile = fopen(filename, "r");
    if (textFile == NULL) {
        printf("Error opening %s\n", filename);
        return 1;
    }

    // Read the content of the text file
    fread(text, 1, sizeof(text), textFile);
    fclose(textFile);

    // Open and read the keyword file
    FILE *keywordFile = fopen("keyword.txt", "r");
    if (keywordFile == NULL) {
        printf("Error opening keyword.txt\n");
        return 1;
    }

    // Initialize keyword frequencies
    while (fscanf(keywordFile, "%s", keywords[totalKeywords].keyword) != EOF) {
        keywords[totalKeywords].frequency = 0;
        totalKeywords++;
    }
    fclose(keywordFile);

    // Tokenize and compare the text with keywords
    char *token = strtok(text, " ");
    while (token != NULL) {
        for (int i = 0; i < totalKeywords; i++) {
            if (strcmp(token, keywords[i].keyword) == 0) {
                keywords[i].frequency++;
            }
        }
        token = strtok(NULL, " ");
    }

    // Sort keyword frequencies in descending order
    qsort(keywords, totalKeywords, sizeof(KeywordFrequency), compareKeywords);

    // Print the results in a table format
    printf("Keyword\tFrequency\n");
    for (int i = 0; i < totalKeywords; i++) {
        printf("%s\t%d\n", keywords[i].keyword, keywords[i].frequency);
    }

    return 0;
}

enter image description here

any help would be greatly appreciated

Upvotes: 2

Views: 151

Answers (2)

David C. Rankin
David C. Rankin

Reputation: 84569

As @someprogrammerdude points out in his answer, using "%s" will only read a single whitespace separated word. In order to read an entire line of text into a sufficiently sized buffer, fgets() is the preferred function. (in fact it is the preferred manner of reading text or user-input in just about all cases). You then parse (separate) the values you need from the buffer filled by fgets() with sscanf(). This provides several benefits over using fscanf() directly. It decouples the read and parse and allows separate validation of each step.

Continuing from my comments, don't use Magic-Numbers. 50 in your struct definition is a Magic-Number. You do a good job declaring constants with #define, just define a constant for the keyword member array as well. You also do a great job Not Skimping On Buffer Size with MAX_TEXT_LENGTH, but you can tailor that a bit more narrowly for the file at hand. Generally lines are no longer than 1K or so. 100K certainly covers that, but on embedded or memory limited systems, that may be a bit much for a read buffer. For the lines here, 100 bytes is fine. (that's still 10X the longest line) So the following is fine:

#define MAX_FILENAME_LENGTH 256
// #define MAX_TEXT_LENGTH 100000
#define MAX_KEYWORDS 100
#define MAX_LINE MAX_KEYWORDS
#define MAX_CHARS 50

(note: limits.h provides the PATH_MAX define giving the maximum pathname length for the system which is a convenient way to size your filename array)

Now, to eliminate your Magic-Number, simply define your KeywordFrequency struct as:

/* Structure to store keyword frequencies */
typedef struct KeywordFrequency {
  char keyword[MAX_CHARS];
  int frequency;
} KeywordFrequency;

While you variable declarations are fine, it is a good habit to initialize all variables and arrays (especially when learning C) to prevent an inadvertent access of a variable with automatic storage duration while its value is indeterminate (a/k/a Undefined Behavor) If possible, by defining your variables at the start of the scope you ensure your code is portable to all C language standards. With that, a slight rearranging could be:

  char  filename[MAX_FILENAME_LENGTH] = "", /* initialize empty-string */
        text[MAX_LINE] = "";
  int totalKeywords = 0;
  FILE *textFile = NULL;
  /* initialize array of struct all zero */
  KeywordFrequency keywords[MAX_KEYWORDS] = {{.keyword = ""}};

All users input should be taken with fgets() to ensure an entire line at a time is read (including the '\n' generated by the user pressing the Enter key). That way, there is never any question whether additional character remain in your input stream (e.g. stdin) unread -- just waiting to bite you on your next read attempt. To remove the '\n' read and included in the buffer filled by fgets(), simply use strcspn() as shown below. Your input can be taken as:

  /* prompt for name of the text file */
  fputs ("Enter the name of the text file: ", stdout);
  /* validate every input */
  if (!fgets (filename, MAX_FILENAME_LENGTH, stdin)) {
    /* how do you know this occurred? */
    puts ("user canceled input with manual EOF");
    return 0;
  }
  
  /* trim '\n' from end of filename overwriting with nul-character */
  filename[strcspn (filename, "\n")] = 0;

Now open your file and enter you read-loop reading the contents of your file into your array of struct KeywordFrequency. Note: you always want to condition your read-loop on the return of your read-function. Here you can do that as:

  /* Open and read the text file */
  if (!(textFile = fopen(filename, "r"))) {
    perror ("textfile");
    return 1;
  }
  
  /* condition your read loop on your read function return */
  while (fgets (text, MAX_LINE, textFile)) {
    /* validate parse of information */
    if (sscanf (text, "%49s %d", keywords[totalKeywords].keyword,
                                 &keywords[totalKeywords].frequency) != 2) {
      /* handle invalid formatted line */
      fprintf (stderr, "error: bad format for index %d\n", totalKeywords);
      /* skip to next line in file */
      continue;
    }
    
    totalKeywords += 1;     /* increment index */
  }

(note: the format-width modifier used with "%49s" to protect the bounds of the keyword array. You cannot use a defined constant in the format-string, so you just have to manually enter length-1 to ensure space for the nul-terminating character. And, no, this isn't considered a Magic-Number, it's just the way C works. But it is a Magic-Number in the sense it will save you from UB and a potential buffer overrun exploit...)

All that remains is sorting with qsort(). Per my comment, you want to rearrange your compare() function so it returns the result of a conditional expression rather than a subtraction. Why? You can easily overflow subtracting two large negative numbers. Using a comparison avoids that potential, e.g.

/* qsort compare function for sorting keyword frequencies
 * using (a < b) - (a > b) for descending sort to avoid potential overflow.
 * for ascending sort use: (a > b) - (a < b)
 */
int compareKeywords(const void *a, const void *b) {
  const KeywordFrequency  *x = a,
                          *y = b;
  
  return (x->frequency < y->frequency) - (x->frequency > y->frequency);
}
...
  /* Sort keyword frequencies in descending order */
  qsort (keywords, totalKeywords, sizeof *keywords, compareKeywords);

  /* Print the results in a table format */
  printf ("%-16s Frequency\n", "Keywords");
  for (int i = 0; i < totalKeywords; i++) {
    printf ("%-16s %d\n", keywords[i].keyword, keywords[i].frequency);
  }

Your basic code shows you are working hard to learn C the right way. That is the only way to learn C. Otherwise, you double the learning time working to break bad habits and learn what you should have learned the first time. Keep up the good work. The complete example is shown below:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_FILENAME_LENGTH 256
// #define MAX_TEXT_LENGTH 100000
#define MAX_KEYWORDS 100
#define MAX_LINE MAX_KEYWORDS
#define MAX_CHARS 50

/* Structure to store keyword frequencies */
typedef struct KeywordFrequency {
  char keyword[MAX_CHARS];
  int frequency;
} KeywordFrequency;

/* qsort compare function for sorting keyword frequencies
 * using (a < b) - (a > b) for descending sort to avoid potential overflow.
 * for ascending sort use: (a > b) - (a < b)
 */
int compareKeywords(const void *a, const void *b) {
  const KeywordFrequency  *x = a,
                          *y = b;
  
  return (x->frequency < y->frequency) - (x->frequency > y->frequency);
}

int main() {
    
  char  filename[MAX_FILENAME_LENGTH] = "", /* initialize empty-string */
        text[MAX_LINE] = "";
  int totalKeywords = 0;
  FILE *textFile = NULL;
  /* initialize array of struct all zero */
  KeywordFrequency keywords[MAX_KEYWORDS] = {{.keyword = ""}};

  /* prompt for name of the text file */
  fputs ("Enter the name of the text file: ", stdout);
  /* validate every input */
  if (!fgets (filename, MAX_FILENAME_LENGTH, stdin)) {
    /* how do you know this occurred? */
    puts ("user canceled input with manual EOF");
    return 0;
  }
  
  /* trim '\n' from end of filename overwriting with nul-character */
  filename[strcspn (filename, "\n")] = 0;
  
  /* Open and read the text file */
  if (!(textFile = fopen(filename, "r"))) {
    perror ("textfile");
    return 1;
  }
  
  /* condition your read loop on your read function return */
  while (fgets (text, MAX_LINE, textFile)) {
    /* validate parse of information */
    if (sscanf (text, "%49s %d", keywords[totalKeywords].keyword,
                                 &keywords[totalKeywords].frequency) != 2) {
      /* handle invalid formatted line */
      fprintf (stderr, "error: bad format for index %d\n", totalKeywords);
      /* skip to next line in file */
      continue;
    }
    
    totalKeywords += 1;     /* increment index */
  }
  
  /* Sort keyword frequencies in descending order */
  qsort (keywords, totalKeywords, sizeof *keywords, compareKeywords);

  /* Print the results in a table format */
  printf ("%-16s Frequency\n", "Keywords");
  for (int i = 0; i < totalKeywords; i++) {
    printf ("%-16s %d\n", keywords[i].keyword, keywords[i].frequency);
  }
}

Example Use/Output

Compiling the code above (with Full Warnings Enabled as you compile all programs) and then running the program with your data file as input, you would get:

$ ./bin/wordfreq-read-sort
Enter the name of the text file: dat/wordfreq.txt
Keywords         Frequency
euery            8
common           8
gaue             7
thankes          5
vnkle            4
growes           3
wag              3
seal             3
day              3
soft             3

(note: you can also sort by frequency and then alphabetically if the frequencies are equal with only a couple additional lines of code)

Look things over and let me know if you have questions.


========= Edit - Per-Comment Needing to Find Word Frequency =========

Okay, so your file in your question was an example OUTPUT file, not an input file -- got it.

In short summary of the changes needed:

  • Your use of fscanf (fp, "%s", word) is a fine way to read one-word-at-a-time (with the appropriate field-width modifier).
  • Rather than allocating 100K of your structures, just allocate 2 to begin with and then realloc() when you fill up what you have allocated. With your 115 pages of text, a simple doubling of size each time you need to realloc() requires only 17 realloc() calls to accommodate the file. (it is a good balance between memory growth and the number of times you need to reallocate in most cases)
  • once you read a word, you need a function to check if it already exists in your collection of words, if it does, just increment the frequency by one, if not, check if you have space in your collection to add the word, if not realloc(), then add the word and set the frequency to 1.
  • Left to you:
    • the "%s" does not discriminate between "[The" and "The", so you should check and remove punctuation as needed. And trailing punctuation like "yours", "yours,", "yours.", "yours:" and "yours;" (all also in your input file). A simple function that strips leading and trailing punctuation from buf would work fine.
    • you have several very long hyphenated words to deal with, e.g. "Pastoricall-Comicall-Historicall-Pastorall" and "Tragicall-Comicall-Historicall-Pastorall". There is no clear-cut rule to handle hyphenated words. They are completely valid, and more so in the context of your input document.
    • one option is to keep a maximum word-length and then dynamically size your output spacing to accommodate the longest word in your list. You can do that using the '*' in the printf() format string providing the maximum word-length variable to specify the minimum field width for the keyword field.

With that premise, I can adapt your code fairly easily to do what is needed. (rather than adapt, I just rewrote it, I don't like camelCase variable names, and I try to keep the names short but descriptive -- because I don't like to type :)

You can do something similar to:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXWORDCHAR 64

typedef struct {
  char word[MAXWORDCHAR];
  size_t freq;
} wordfreq;

/* qsort compare function for sorting keyword frequencies
 * using (a < b) - (a > b) for descending sort to avoid potential overflow.
 * for ascending sort use: (a > b) - (a < b)
 */
int cmpwordfreq (const void *a, const void *b) {
  const wordfreq  *x = a,
                  *y = b;
#ifdef ALPHASORTEQUAL
  /* sort alphabetically if frequencies are equal */
  if (x->freq == y->freq) {
    return (strcmp (x->word, y->word));
  }
#endif
  /* sort by frequency */
  return (x->freq < y->freq) - (x->freq > y->freq);
}

/* check if word exists in allocated wordfreq structs
 * returns index of word if it exits, -1 otherwise.
 */
int wordexists (const char *word, const wordfreq *wf, size_t n)
{
  /* validate both word and wf not NULL */
  if (!word || !wf) {
    return -1;    /* no word in struct */
  }
  
  /* loop over each struct comparing word stored with word */
  for (size_t i = 0; i < n; i++) {
    if (strcmp (word, wf[i].word) == 0) {
      return i;   /* return index if found */
    }
  }
  
  return -1;      /* no word in struct */
}

int main (int argc, char **argv) {
  
  char buf[MAXWORDCHAR];    /* buffer to hold each word in file */
  wordfreq *wf = NULL;     /* pointer to allocated collection of struct */
  size_t  allocated = 0,    /* no. of structs allocated */
          used = 0;         /* no. of structs used */
  /* use filename provided as 1st argument (stdin by default) */
  FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

  if (!fp) {  /* validate file open for reading */
    perror ("file open failed");
    return 1;
  }
  
  /* loop reading each word in file */
  while (fscanf (fp, "%63s", buf) == 1) {
    int index = wordexists (buf, wf, used);   /* check if word in struct */
    if (index >= 0) {             /* if word exists */
      wf[index].freq += 1;       /* increment freq for word */
      continue;
    }
    /* if all storage used, realloc more */
    if (used == allocated) {
      /* always realloc to a temporary pointer */
      void *tmp = realloc (wf, (allocated ? 2 * allocated : 2) * sizeof *wf);
      /* validate realloc return */
      if (!tmp) {                       /* if realloc failed */
        perror ("realloc-wf-failed");   /* output error */
        break;                          /* break read loop */
      }
      wf = tmp;                                   /* assign realloced block */
      allocated = allocated ? allocated * 2 : 2;  /* update no. allocated */
    }
    
    strcpy (wf[used].word, buf);        /* copy word to struct */
    wf[used++].freq = 1;               /* set freq to 1, increment used */
  }
  
  if (fp != stdin)   /* close file if not stdin */
    fclose (fp);
  
  qsort (wf, used, sizeof *wf, cmpwordfreq);
  
  /* Print the results in a table format */
  printf ("%-20s Frequency\n", "Keywords");
  for (size_t i = 0; i < used; i++) {
    printf ("%-20s %zu\n", wf[i].word, wf[i].freq);
  }
  
  free (wf);    /* don't forget to free what you allocated */
}

NOTE: I added to preprocessor conditional of ALPHASORTEQUAL to allow you to alphabetically sort words with equal frequency. The preprocessor conditional just allows you to conditionally include or exclude code based on whether the definition is made. For example, doing nothing will exclude the code that sorts alphabetically, but adding the define (either in the code) or in your compile string, e.g. -DALPHASORTEQUAL for gcc/clang or /DALPHASORTEQUAL for VS will include the code.

Compiling with Full-Warnings Enabled

I touched on this in the original answer, but didn't give an example. Using gcc a compile string including -Wall -Wextra -pedantic -Wshadow will enable close to all warnings (definitely covering all learning to code in C cases). Add -Werror to treat warnings as errors (so can't cheat). Do not accept code until it compiles without warning. For the VS compiler (cl.exe) using /W3 will do.

When warnings occur, read and understand each warning -- then go fix it. The warnings will identify any problems, and the exact line on which they occur. You can learn a lot simply by listening to what your compiler is telling you. They have gotten very, very good at providing descriptive warnings in the past 20 years.

Example compile string I used with this code:

$ gcc -Wall -Wextra -pedantic -Wshadow -std=c11 -Ofast -o bin/word-freq word-freq.c

To enable the alphabetical sort if the word frequencies are equal, just define ALPHASORTEQUAL, e.g.

$ gcc -Wall -Wextra -pedantic -Wshadow -std=c11 -Ofast -DALPHASORTEQUAL -o bin/word-freq word-freq.c

(note: I put by data files in a dat/ directory under the current and my executables in a bin/ directory under the current to keep the C-source files in the current directly uncluttered with other things. That is why you see -o bin/... in the compile string and dat/... below for the filename to read)

Example Use/Output

NOTE: if you attempt to save the file from google-docs as Plain Text it is saved as with a BOM (Byte Order Mark) in DOS format. If on Linux run dos2unix filename.txt to convert it to Unix line-ending and remove the BOM.

With the converted file in dat/word-freq-main.txt, you simply pass it as the first command-line argument to the program, e.g.

$ ./bin/word-freq dat/word-freq-main.txt
Keywords             Frequency
the                  860
and                  605
to                   574
of                   571
I                    520
a                    435
my                   433
you                  402
Ham.                 337
in                   336
it                   285
is                   273
his                  264
And                  255
not                  254
your                 233
that                 227
with                 207
this                 195
me                   163
for                  159
be                   158
haue                 154
he                   149
but                  145
as                   134
The                  132
will                 122
our                  118
him                  116
That                 114
To                   109
so                   109
what                 108
are                  107
But                  103
shall                103
on                   100
King.                96
we                   96
Hor.                 95
by                   88
thou                 87
Lord                 83
no                   83
thy                  83
What                 82
from                 79
all                  78
her                  78
Lord,                73
or                   73
more                 70
Enter                69
at                   69
good                 69
was                  69
For                  68
Oh                   68
they                 68
do                   67
My                   66
like                 66
most                 65
It                   64
if                   63
A                    62
As                   62
Qu.                  62
Laer.                60
then                 59
Ile                  58
know                 58
Ophe.                56
let                  55
How                  54
may                  54
would                54
must                 52
should               50
Pol.                 49
hath                 49
an                   48
If                   47
You                  47
there                47
am                   46
did                  46
now                  46
their                46
come                 45
such                 45
them                 45
So                   44
very                 44
Rosin.               43
This                 43
make                 43
selfe                43
With                 42
some                 42
Hamlet               41
In                   41
Why                  41
much                 41
He                   40
King                 40
Let                  39
thee                 39
Polon.               38
giue                 38
had                  37
speake               37
tell                 37
too                  37
vpon                 37
vs                   37
'tis                 36
doe                  36
thinke               36
out                  35
Of                   34
mine                 34
one                  34
well                 34
Lord?                33
Then                 33
she                  33
these                33
which                33
see                  32
Mar.                 31
No                   31
O                    31
how                  31
it,                  31
We                   30
can                  30
say                  30
where                30
Clo.                 29
Or                   29
loue                 29
man                  29
you,                 29
King,                28
heere                28
him,                 28
owne                 28
go                   27
when                 27
'Tis                 26
Father               26
Hamlet,              26
cannot               26
could                26
might                26
time                 26
yet                  26
&                    25
th'                  25
When                 24
heare                24
made                 24
thus                 24
vp                   24
were                 24
Is                   23
Sir,                 23
doth                 23
Come                 22
Exeunt.              22
Good                 22
I,                   22
Now                  22
call                 22
into                 22
me,                  22
take                 22
Hamlet.              21
Where                21
Who                  21
hast                 21
put                  21
Giue                 20
Guild.               20
Which                20
Your                 20
hold                 20
set                  20
His                  19
great                19
him.                 19
i'th'                19
leaue                19
two                  19
whose                19
Do                   18
Exit                 18
Osr.                 18
They                 18
both                 18
comes                18
nor                  18
not,                 18
poore                18
pray                 18
sweet                18
thing                18
those                18
Fathers              17
Haue                 17
Heauen               17
Horatio,             17
Mother               17
Queene               17
Will                 17
against              17
any                  17
dead                 17
neuer                17
shew                 17
vs,                  17
Be                   16
come,                16
deere                16
does                 16
though               16
By                   15
God                  15
Laertes,             15
Nay,                 15
againe               15
death,               15
head                 15
looke                15
meanes               15
old                  15
so,                  15
you:                 15
Kin.                 14
No,                  14
Noble                14
Nor                  14
Not                  14
Thou                 14
about                14
bee                  14
hee                  14
keepe                14
many                 14
matter               14
me.                  14
nothing              14
seene                14
selfe,               14
this?                14
world                14
Enter.               13
Looke                13
Lord.                13
Players              13
Reynol.              13
There                13
There's              13
Vpon                 13
beleeue              13
euen                 13
faire                13
finde                13
goe                  13
ha's                 13
it.                  13
liue                 13
makes                13
night                13
off                  13
same                 13
sent                 13
who                  13
you?                 13
At                   12
Horatio              12
Laertes              12
Most                 12
Mother,              12
Other.               12
Our                  12
Queene,              12
Sir                  12
Would                12
beare                12
better               12
betweene             12
farre                12
hand                 12
part                 12
saw                  12
this,                12
true                 12
words                12
Come,                11
Father,              11
Heauen,              11
Ophelia,             11
beene                11
day                  11
death                11
downe                11
eare                 11
ere                  11
euer                 11
feare                11
first                11
life                 11
life,                11
little               11
long                 11
once                 11
other                11
play                 11
vse                  11
well,                11
why                  11
Are                  10
Barn.                10
Denmarke             10
Did                  10
England              10
Ghost.               10
Guil.                10
Nay                  10
Soule                10
Take                 10
Till                 10
Yet                  10
be,                  10
before               10
done                 10
downe,               10
end                  10
fit                  10
foule                10
him:                 10
is't                 10
is,                  10
me?                  10
now,                 10
o're                 10
sir,                 10
stand                10
then,                10
thoughts             10
<snip -- many more>

Memory Use/Error Check

In any code you write that dynamically allocates memory, you have 2 responsibilities regarding any block of memory allocated: (1) always preserve a pointer to the starting address for the block of memory so, (2) it can be freed when it is no longer needed.

It is imperative that you use a memory error checking program to ensure you do not attempt to access memory or write beyond/outside the bounds of your allocated block, attempt to read or base a conditional jump on an uninitialized value, and finally, to confirm that you free all the memory you have allocated.

For Linux valgrind is the normal choice. There are similar memory checkers for every platform. They are all simple to use, just run your program through it.

$ valgrind ./bin/word-freq dat/word-freq-main.txt
<snip output>
yours                1
yours,               1
yours.               1
yours:               1
yours;               1
youth:               1
==14914==
==14914== HEAP SUMMARY:
==14914==     in use at exit: 0 bytes in 0 blocks
==14914==   total heap usage: 17 allocs, 17 frees, 1,303,920 bytes allocated
==14914==
==14914== All heap blocks were freed -- no leaks are possible
==14914==
==14914== For lists of detected and suppressed errors, rerun with: -s
==14914== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Always confirm that you have freed all memory you have allocated and that there are no memory errors.

Let me know if you have further questions.

Upvotes: 3

Some programmer dude
Some programmer dude

Reputation: 409216

Two things:

The first is that the scanf format %s reads space delimited words. So there's nothing to tokenize. You never even read anything into text which you attempt to tokenize;

The second thing is that you read every single word of the input file into a separate "keyword" element, which is not what you should do. And will go out of bounds of the array if there's more than 100 words in the input file.

Instead you should read into another variable and search the current keywords, and find if it's in the array or not, and if it is then increase the frequency, else add it.

Upvotes: 1

Related Questions