theprogrammer
theprogrammer

Reputation: 2029

Why is my wc implementation giving wrong word count?

Here is a small code snippet.

 while((c = fgetc(fp)) != -1)
    {
        cCount++; // character count
        if(c == '\n') lCount++; // line count
        else 
        {
            if(c == ' ' && prevC != ' ') wCount++; // word count
        }
        prevC = c; // previous character equals current character. Think of it as memory.
    }

Now when I run wc with the file containing this above snippet code(as is), I am getting 48 words, but when I use my program on same input data, I am getting 59 words.

How to calculate word count exactly like wc does?

Upvotes: 0

Views: 3186

Answers (4)

H.S.
H.S.

Reputation: 12679

You can do:

int count()
{
    unsigned int cCount = 0, wCount = 0, lCount = 0;
    int incr_word_count = 0;
    char c;
    FILE *fp = fopen ("text", "r");

    if (fp == NULL)
    {
            printf ("Failed to open file\n");
            return -1;
    }

    while((c = fgetc(fp)) != EOF)
    {
            cCount++; // character count
            if(c == '\n') lCount++; // line count
            if (c == ' ' || c == '\n' || c == '\t')
                    incr_word_count = 0;
            else if (incr_word_count == 0) {
                    incr_word_count = 1;
                     wCount++; // word count
            }
    }
    fclose (fp);
    printf ("line : %u\n", lCount);
    printf ("word : %u\n", wCount);
    printf ("char : %u\n", cCount);
    return 0;
}

Upvotes: 0

Chatz
Chatz

Reputation: 56

There is an example of the function you want in the book: "Brian W Kernighan And Dennis M Ritchie: The Ansi C Programming Language". As the author says: This is a bare-bones version of the UNIX program wc. Altered to count only words is like this:

#include <stdio.h>

#define IN 1 /* inside a word */
#define OUT 0 /* outside a word */

/* nw counts words in input */
main()
{
  int c, nw, state;
  state = OUT;
  nw = 0;
  while ((c = getchar()) != EOF) {
    if (c == ' ' || c == '\n' || c == '\t')
       state = OUT;
    else if (state == OUT) {
       state = IN;
       ++nw;
    }
  } 
  printf("%d\n", nw);
 }

Upvotes: 1

Kevin
Kevin

Reputation: 7324

You are treating anything that isn't a space as a valid word. This means that a newline followed by a space is a word, and since your input (which is your code snippet) is indented you get a bunch of extra words.

You should use isspace to check for whitespace instead of comparing the character to ' ':

while((c = fgetc(fp)) != EOF)
{
    cCount++;
    if (c == '\n')
        lCount++;
    if (isspace(c) && !isspace(prevC))
        wCount++;
    prevC = c;
}

Upvotes: 1

Sreedev Shibu
Sreedev Shibu

Reputation: 136

Instead of checking for spaces only you should check for escape sequences like \t \n space and so on.

This will give the correct results. You can use isspace() from <ctype.h>

Change the line

if(c == ' ' && prevC != ' ') wCount++;

to

if(isspace(c) && !(isspace(prevC)) wCount++;

This would give the correct results. Don't forget to include <ctype.h>

Upvotes: 0

Related Questions