Reputation: 21
hope everyone is well! im a bit stuck on my programming assignment, the purpose of the c code is to compare a txt file (name given as user input) to another keywords txt file. then record the frequency of each word and print them in descending order. This should look like this:
euery 8
common 8
gaue 7
thankes 5
vnkle 4
growes 3
wag 3
seal 3
day 3
soft 3
sadly this isnt what my code produces. the code is as shown below:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_FILENAME_LENGTH 256
#define MAX_TEXT_LENGTH 100000
#define MAX_KEYWORDS 100
// Structure to store keyword frequencies
typedef struct KeywordFrequency {
char keyword[50];
int frequency;
} KeywordFrequency;
// Compare function for sorting keyword frequencies
int compareKeywords(const void *a, const void *b) {
return ((KeywordFrequency *)b)->frequency - ((KeywordFrequency *)a)->frequency;
}
int main() {
char filename[MAX_FILENAME_LENGTH];
char text[MAX_TEXT_LENGTH];
KeywordFrequency keywords[MAX_KEYWORDS];
int totalKeywords = 0;
// Ask the user for the name of the text file
printf("Enter the name of the text file: ");
scanf("%s", filename);
// Open and read the text file
FILE *textFile = fopen(filename, "r");
if (textFile == NULL) {
printf("Error opening %s\n", filename);
return 1;
}
// Read the content of the text file
fread(text, 1, sizeof(text), textFile);
fclose(textFile);
// Open and read the keyword file
FILE *keywordFile = fopen("keyword.txt", "r");
if (keywordFile == NULL) {
printf("Error opening keyword.txt\n");
return 1;
}
// Initialize keyword frequencies
while (fscanf(keywordFile, "%s", keywords[totalKeywords].keyword) != EOF) {
keywords[totalKeywords].frequency = 0;
totalKeywords++;
}
fclose(keywordFile);
// Tokenize and compare the text with keywords
char *token = strtok(text, " ");
while (token != NULL) {
for (int i = 0; i < totalKeywords; i++) {
if (strcmp(token, keywords[i].keyword) == 0) {
keywords[i].frequency++;
}
}
token = strtok(NULL, " ");
}
// Sort keyword frequencies in descending order
qsort(keywords, totalKeywords, sizeof(KeywordFrequency), compareKeywords);
// Print the results in a table format
printf("Keyword\tFrequency\n");
for (int i = 0; i < totalKeywords; i++) {
printf("%s\t%d\n", keywords[i].keyword, keywords[i].frequency);
}
return 0;
}
any help would be greatly appreciated
Upvotes: 2
Views: 151
Reputation: 84569
As @someprogrammerdude points out in his answer, using "%s"
will only read a single whitespace separated word. In order to read an entire line of text into a sufficiently sized buffer, fgets()
is the preferred function. (in fact it is the preferred manner of reading text or user-input in just about all cases). You then parse (separate) the values you need from the buffer filled by fgets()
with sscanf()
. This provides several benefits over using fscanf()
directly. It decouples the read and parse and allows separate validation of each step.
Continuing from my comments, don't use Magic-Numbers. 50
in your struct definition is a Magic-Number. You do a good job declaring constants with #define
, just define a constant for the keyword
member array as well. You also do a great job Not Skimping On Buffer Size with MAX_TEXT_LENGTH
, but you can tailor that a bit more narrowly for the file at hand. Generally lines are no longer than 1K or so. 100K certainly covers that, but on embedded or memory limited systems, that may be a bit much for a read buffer. For the lines here, 100 bytes is fine. (that's still 10X the longest line) So the following is fine:
#define MAX_FILENAME_LENGTH 256
// #define MAX_TEXT_LENGTH 100000
#define MAX_KEYWORDS 100
#define MAX_LINE MAX_KEYWORDS
#define MAX_CHARS 50
(note: limits.h
provides the PATH_MAX
define giving the maximum pathname length for the system which is a convenient way to size your filename
array)
Now, to eliminate your Magic-Number, simply define your KeywordFrequency
struct as:
/* Structure to store keyword frequencies */
typedef struct KeywordFrequency {
char keyword[MAX_CHARS];
int frequency;
} KeywordFrequency;
While you variable declarations are fine, it is a good habit to initialize all variables and arrays (especially when learning C) to prevent an inadvertent access of a variable with automatic storage duration while its value is indeterminate (a/k/a Undefined Behavor) If possible, by defining your variables at the start of the scope you ensure your code is portable to all C language standards. With that, a slight rearranging could be:
char filename[MAX_FILENAME_LENGTH] = "", /* initialize empty-string */
text[MAX_LINE] = "";
int totalKeywords = 0;
FILE *textFile = NULL;
/* initialize array of struct all zero */
KeywordFrequency keywords[MAX_KEYWORDS] = {{.keyword = ""}};
All users input should be taken with fgets()
to ensure an entire line at a time is read (including the '\n'
generated by the user pressing the Enter key). That way, there is never any question whether additional character remain in your input stream (e.g. stdin
) unread -- just waiting to bite you on your next read attempt. To remove the '\n'
read and included in the buffer filled by fgets()
, simply use strcspn()
as shown below. Your input can be taken as:
/* prompt for name of the text file */
fputs ("Enter the name of the text file: ", stdout);
/* validate every input */
if (!fgets (filename, MAX_FILENAME_LENGTH, stdin)) {
/* how do you know this occurred? */
puts ("user canceled input with manual EOF");
return 0;
}
/* trim '\n' from end of filename overwriting with nul-character */
filename[strcspn (filename, "\n")] = 0;
Now open your file and enter you read-loop reading the contents of your file into your array of struct KeywordFrequency
. Note: you always want to condition your read-loop on the return of your read-function. Here you can do that as:
/* Open and read the text file */
if (!(textFile = fopen(filename, "r"))) {
perror ("textfile");
return 1;
}
/* condition your read loop on your read function return */
while (fgets (text, MAX_LINE, textFile)) {
/* validate parse of information */
if (sscanf (text, "%49s %d", keywords[totalKeywords].keyword,
&keywords[totalKeywords].frequency) != 2) {
/* handle invalid formatted line */
fprintf (stderr, "error: bad format for index %d\n", totalKeywords);
/* skip to next line in file */
continue;
}
totalKeywords += 1; /* increment index */
}
(note: the format-width modifier used with "%49s"
to protect the bounds of the keyword
array. You cannot use a defined constant in the format-string, so you just have to manually enter length-1
to ensure space for the nul-terminating character. And, no, this isn't considered a Magic-Number, it's just the way C works. But it is a Magic-Number in the sense it will save you from UB and a potential buffer overrun exploit...)
All that remains is sorting with qsort()
. Per my comment, you want to rearrange your compare()
function so it returns the result of a conditional expression rather than a subtraction. Why? You can easily overflow subtracting two large negative numbers. Using a comparison avoids that potential, e.g.
/* qsort compare function for sorting keyword frequencies
* using (a < b) - (a > b) for descending sort to avoid potential overflow.
* for ascending sort use: (a > b) - (a < b)
*/
int compareKeywords(const void *a, const void *b) {
const KeywordFrequency *x = a,
*y = b;
return (x->frequency < y->frequency) - (x->frequency > y->frequency);
}
...
/* Sort keyword frequencies in descending order */
qsort (keywords, totalKeywords, sizeof *keywords, compareKeywords);
/* Print the results in a table format */
printf ("%-16s Frequency\n", "Keywords");
for (int i = 0; i < totalKeywords; i++) {
printf ("%-16s %d\n", keywords[i].keyword, keywords[i].frequency);
}
Your basic code shows you are working hard to learn C the right way. That is the only way to learn C. Otherwise, you double the learning time working to break bad habits and learn what you should have learned the first time. Keep up the good work. The complete example is shown below:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_FILENAME_LENGTH 256
// #define MAX_TEXT_LENGTH 100000
#define MAX_KEYWORDS 100
#define MAX_LINE MAX_KEYWORDS
#define MAX_CHARS 50
/* Structure to store keyword frequencies */
typedef struct KeywordFrequency {
char keyword[MAX_CHARS];
int frequency;
} KeywordFrequency;
/* qsort compare function for sorting keyword frequencies
* using (a < b) - (a > b) for descending sort to avoid potential overflow.
* for ascending sort use: (a > b) - (a < b)
*/
int compareKeywords(const void *a, const void *b) {
const KeywordFrequency *x = a,
*y = b;
return (x->frequency < y->frequency) - (x->frequency > y->frequency);
}
int main() {
char filename[MAX_FILENAME_LENGTH] = "", /* initialize empty-string */
text[MAX_LINE] = "";
int totalKeywords = 0;
FILE *textFile = NULL;
/* initialize array of struct all zero */
KeywordFrequency keywords[MAX_KEYWORDS] = {{.keyword = ""}};
/* prompt for name of the text file */
fputs ("Enter the name of the text file: ", stdout);
/* validate every input */
if (!fgets (filename, MAX_FILENAME_LENGTH, stdin)) {
/* how do you know this occurred? */
puts ("user canceled input with manual EOF");
return 0;
}
/* trim '\n' from end of filename overwriting with nul-character */
filename[strcspn (filename, "\n")] = 0;
/* Open and read the text file */
if (!(textFile = fopen(filename, "r"))) {
perror ("textfile");
return 1;
}
/* condition your read loop on your read function return */
while (fgets (text, MAX_LINE, textFile)) {
/* validate parse of information */
if (sscanf (text, "%49s %d", keywords[totalKeywords].keyword,
&keywords[totalKeywords].frequency) != 2) {
/* handle invalid formatted line */
fprintf (stderr, "error: bad format for index %d\n", totalKeywords);
/* skip to next line in file */
continue;
}
totalKeywords += 1; /* increment index */
}
/* Sort keyword frequencies in descending order */
qsort (keywords, totalKeywords, sizeof *keywords, compareKeywords);
/* Print the results in a table format */
printf ("%-16s Frequency\n", "Keywords");
for (int i = 0; i < totalKeywords; i++) {
printf ("%-16s %d\n", keywords[i].keyword, keywords[i].frequency);
}
}
Example Use/Output
Compiling the code above (with Full Warnings Enabled as you compile all programs) and then running the program with your data file as input, you would get:
$ ./bin/wordfreq-read-sort
Enter the name of the text file: dat/wordfreq.txt
Keywords Frequency
euery 8
common 8
gaue 7
thankes 5
vnkle 4
growes 3
wag 3
seal 3
day 3
soft 3
(note: you can also sort by frequency and then alphabetically if the frequencies are equal with only a couple additional lines of code)
Look things over and let me know if you have questions.
========= Edit - Per-Comment Needing to Find Word Frequency =========
Okay, so your file in your question was an example OUTPUT file, not an input file -- got it.
In short summary of the changes needed:
fscanf (fp, "%s", word)
is a fine way to read one-word-at-a-time (with the appropriate field-width modifier).realloc()
when you fill up what you have allocated. With your 115 pages of text, a simple doubling of size each time you need to realloc()
requires only 17 realloc()
calls to accommodate the file. (it is a good balance between memory growth and the number of times you need to reallocate in most cases)realloc()
, then add the word and set the frequency to 1."%s"
does not discriminate between "[The"
and "The"
, so you should check and remove punctuation as needed. And trailing punctuation like "yours"
, "yours,"
, "yours."
, "yours:"
and "yours;"
(all also in your input file). A simple function that strips leading and trailing punctuation from buf
would work fine."Pastoricall-Comicall-Historicall-Pastorall"
and "Tragicall-Comicall-Historicall-Pastorall"
. There is no clear-cut rule to handle hyphenated words. They are completely valid, and more so in the context of your input document.'*'
in the printf()
format string providing the maximum word-length variable to specify the minimum field width for the keyword field.With that premise, I can adapt your code fairly easily to do what is needed. (rather than adapt, I just rewrote it, I don't like camelCase variable names, and I try to keep the names short but descriptive -- because I don't like to type :)
You can do something similar to:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAXWORDCHAR 64
typedef struct {
char word[MAXWORDCHAR];
size_t freq;
} wordfreq;
/* qsort compare function for sorting keyword frequencies
* using (a < b) - (a > b) for descending sort to avoid potential overflow.
* for ascending sort use: (a > b) - (a < b)
*/
int cmpwordfreq (const void *a, const void *b) {
const wordfreq *x = a,
*y = b;
#ifdef ALPHASORTEQUAL
/* sort alphabetically if frequencies are equal */
if (x->freq == y->freq) {
return (strcmp (x->word, y->word));
}
#endif
/* sort by frequency */
return (x->freq < y->freq) - (x->freq > y->freq);
}
/* check if word exists in allocated wordfreq structs
* returns index of word if it exits, -1 otherwise.
*/
int wordexists (const char *word, const wordfreq *wf, size_t n)
{
/* validate both word and wf not NULL */
if (!word || !wf) {
return -1; /* no word in struct */
}
/* loop over each struct comparing word stored with word */
for (size_t i = 0; i < n; i++) {
if (strcmp (word, wf[i].word) == 0) {
return i; /* return index if found */
}
}
return -1; /* no word in struct */
}
int main (int argc, char **argv) {
char buf[MAXWORDCHAR]; /* buffer to hold each word in file */
wordfreq *wf = NULL; /* pointer to allocated collection of struct */
size_t allocated = 0, /* no. of structs allocated */
used = 0; /* no. of structs used */
/* use filename provided as 1st argument (stdin by default) */
FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;
if (!fp) { /* validate file open for reading */
perror ("file open failed");
return 1;
}
/* loop reading each word in file */
while (fscanf (fp, "%63s", buf) == 1) {
int index = wordexists (buf, wf, used); /* check if word in struct */
if (index >= 0) { /* if word exists */
wf[index].freq += 1; /* increment freq for word */
continue;
}
/* if all storage used, realloc more */
if (used == allocated) {
/* always realloc to a temporary pointer */
void *tmp = realloc (wf, (allocated ? 2 * allocated : 2) * sizeof *wf);
/* validate realloc return */
if (!tmp) { /* if realloc failed */
perror ("realloc-wf-failed"); /* output error */
break; /* break read loop */
}
wf = tmp; /* assign realloced block */
allocated = allocated ? allocated * 2 : 2; /* update no. allocated */
}
strcpy (wf[used].word, buf); /* copy word to struct */
wf[used++].freq = 1; /* set freq to 1, increment used */
}
if (fp != stdin) /* close file if not stdin */
fclose (fp);
qsort (wf, used, sizeof *wf, cmpwordfreq);
/* Print the results in a table format */
printf ("%-20s Frequency\n", "Keywords");
for (size_t i = 0; i < used; i++) {
printf ("%-20s %zu\n", wf[i].word, wf[i].freq);
}
free (wf); /* don't forget to free what you allocated */
}
NOTE: I added to preprocessor conditional of ALPHASORTEQUAL
to allow you to alphabetically sort words with equal frequency. The preprocessor conditional just allows you to conditionally include or exclude code based on whether the definition is made. For example, doing nothing will exclude the code that sorts alphabetically, but adding the define (either in the code) or in your compile string, e.g. -DALPHASORTEQUAL
for gcc/clang or /DALPHASORTEQUAL
for VS will include the code.
Compiling with Full-Warnings Enabled
I touched on this in the original answer, but didn't give an example. Using gcc
a compile string including -Wall -Wextra -pedantic -Wshadow
will enable close to all warnings (definitely covering all learning to code in C cases). Add -Werror
to treat warnings as errors (so can't cheat). Do not accept code until it compiles without warning. For the VS compiler (cl.exe
) using /W3
will do.
When warnings occur, read and understand each warning -- then go fix it. The warnings will identify any problems, and the exact line on which they occur. You can learn a lot simply by listening to what your compiler is telling you. They have gotten very, very good at providing descriptive warnings in the past 20 years.
Example compile string I used with this code:
$ gcc -Wall -Wextra -pedantic -Wshadow -std=c11 -Ofast -o bin/word-freq word-freq.c
To enable the alphabetical sort if the word frequencies are equal, just define ALPHASORTEQUAL
, e.g.
$ gcc -Wall -Wextra -pedantic -Wshadow -std=c11 -Ofast -DALPHASORTEQUAL -o bin/word-freq word-freq.c
(note: I put by data files in a dat/
directory under the current and my executables in a bin/
directory under the current to keep the C-source files in the current directly uncluttered with other things. That is why you see -o bin/...
in the compile string and dat/...
below for the filename to read)
Example Use/Output
NOTE: if you attempt to save the file from google-docs as Plain Text it is saved as with a BOM (Byte Order Mark) in DOS format. If on Linux run dos2unix filename.txt
to convert it to Unix line-ending and remove the BOM.
With the converted file in dat/word-freq-main.txt
, you simply pass it as the first command-line argument to the program, e.g.
$ ./bin/word-freq dat/word-freq-main.txt
Keywords Frequency
the 860
and 605
to 574
of 571
I 520
a 435
my 433
you 402
Ham. 337
in 336
it 285
is 273
his 264
And 255
not 254
your 233
that 227
with 207
this 195
me 163
for 159
be 158
haue 154
he 149
but 145
as 134
The 132
will 122
our 118
him 116
That 114
To 109
so 109
what 108
are 107
But 103
shall 103
on 100
King. 96
we 96
Hor. 95
by 88
thou 87
Lord 83
no 83
thy 83
What 82
from 79
all 78
her 78
Lord, 73
or 73
more 70
Enter 69
at 69
good 69
was 69
For 68
Oh 68
they 68
do 67
My 66
like 66
most 65
It 64
if 63
A 62
As 62
Qu. 62
Laer. 60
then 59
Ile 58
know 58
Ophe. 56
let 55
How 54
may 54
would 54
must 52
should 50
Pol. 49
hath 49
an 48
If 47
You 47
there 47
am 46
did 46
now 46
their 46
come 45
such 45
them 45
So 44
very 44
Rosin. 43
This 43
make 43
selfe 43
With 42
some 42
Hamlet 41
In 41
Why 41
much 41
He 40
King 40
Let 39
thee 39
Polon. 38
giue 38
had 37
speake 37
tell 37
too 37
vpon 37
vs 37
'tis 36
doe 36
thinke 36
out 35
Of 34
mine 34
one 34
well 34
Lord? 33
Then 33
she 33
these 33
which 33
see 32
Mar. 31
No 31
O 31
how 31
it, 31
We 30
can 30
say 30
where 30
Clo. 29
Or 29
loue 29
man 29
you, 29
King, 28
heere 28
him, 28
owne 28
go 27
when 27
'Tis 26
Father 26
Hamlet, 26
cannot 26
could 26
might 26
time 26
yet 26
& 25
th' 25
When 24
heare 24
made 24
thus 24
vp 24
were 24
Is 23
Sir, 23
doth 23
Come 22
Exeunt. 22
Good 22
I, 22
Now 22
call 22
into 22
me, 22
take 22
Hamlet. 21
Where 21
Who 21
hast 21
put 21
Giue 20
Guild. 20
Which 20
Your 20
hold 20
set 20
His 19
great 19
him. 19
i'th' 19
leaue 19
two 19
whose 19
Do 18
Exit 18
Osr. 18
They 18
both 18
comes 18
nor 18
not, 18
poore 18
pray 18
sweet 18
thing 18
those 18
Fathers 17
Haue 17
Heauen 17
Horatio, 17
Mother 17
Queene 17
Will 17
against 17
any 17
dead 17
neuer 17
shew 17
vs, 17
Be 16
come, 16
deere 16
does 16
though 16
By 15
God 15
Laertes, 15
Nay, 15
againe 15
death, 15
head 15
looke 15
meanes 15
old 15
so, 15
you: 15
Kin. 14
No, 14
Noble 14
Nor 14
Not 14
Thou 14
about 14
bee 14
hee 14
keepe 14
many 14
matter 14
me. 14
nothing 14
seene 14
selfe, 14
this? 14
world 14
Enter. 13
Looke 13
Lord. 13
Players 13
Reynol. 13
There 13
There's 13
Vpon 13
beleeue 13
euen 13
faire 13
finde 13
goe 13
ha's 13
it. 13
liue 13
makes 13
night 13
off 13
same 13
sent 13
who 13
you? 13
At 12
Horatio 12
Laertes 12
Most 12
Mother, 12
Other. 12
Our 12
Queene, 12
Sir 12
Would 12
beare 12
better 12
betweene 12
farre 12
hand 12
part 12
saw 12
this, 12
true 12
words 12
Come, 11
Father, 11
Heauen, 11
Ophelia, 11
beene 11
day 11
death 11
downe 11
eare 11
ere 11
euer 11
feare 11
first 11
life 11
life, 11
little 11
long 11
once 11
other 11
play 11
vse 11
well, 11
why 11
Are 10
Barn. 10
Denmarke 10
Did 10
England 10
Ghost. 10
Guil. 10
Nay 10
Soule 10
Take 10
Till 10
Yet 10
be, 10
before 10
done 10
downe, 10
end 10
fit 10
foule 10
him: 10
is't 10
is, 10
me? 10
now, 10
o're 10
sir, 10
stand 10
then, 10
thoughts 10
<snip -- many more>
Memory Use/Error Check
In any code you write that dynamically allocates memory, you have 2 responsibilities regarding any block of memory allocated: (1) always preserve a pointer to the starting address for the block of memory so, (2) it can be freed when it is no longer needed.
It is imperative that you use a memory error checking program to ensure you do not attempt to access memory or write beyond/outside the bounds of your allocated block, attempt to read or base a conditional jump on an uninitialized value, and finally, to confirm that you free all the memory you have allocated.
For Linux valgrind
is the normal choice. There are similar memory checkers for every platform. They are all simple to use, just run your program through it.
$ valgrind ./bin/word-freq dat/word-freq-main.txt
<snip output>
yours 1
yours, 1
yours. 1
yours: 1
yours; 1
youth: 1
==14914==
==14914== HEAP SUMMARY:
==14914== in use at exit: 0 bytes in 0 blocks
==14914== total heap usage: 17 allocs, 17 frees, 1,303,920 bytes allocated
==14914==
==14914== All heap blocks were freed -- no leaks are possible
==14914==
==14914== For lists of detected and suppressed errors, rerun with: -s
==14914== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Always confirm that you have freed all memory you have allocated and that there are no memory errors.
Let me know if you have further questions.
Upvotes: 3
Reputation: 409216
Two things:
The first is that the scanf
format %s
reads space delimited words. So there's nothing to tokenize. You never even read anything into text
which you attempt to tokenize;
The second thing is that you read every single word of the input file into a separate "keyword" element, which is not what you should do. And will go out of bounds of the array if there's more than 100 words in the input file.
Instead you should read into another variable and search the current keywords, and find if it's in the array or not, and if it is then increase the frequency, else add it.
Upvotes: 1