Reputation: 2890

Counting the number of occurrences of words in a textfile

How could I go about keeping track of the number of times a word appears in a textfile? I would like to do this for every word.

For example, if the input is something like:

"the man said hi to the boy."

Each of "man said hi to boy" would have an occurrence of 1.

"the" would have an occurence of 2.

I was thinking of keeping a dictionary with word/occurrence pairs but I'm not sure how to implement this in C. A link to any similar or related problems with a solution would be great.

EDIT: To avoid rolling out my own hash table I decided to learn how to use glib. Along the way I found an excellent tutorial which walks through a similar problem. http://bo.majewski.name/bluear/gnu/GLib/ch03s03.html

I am awestruck by the number of different approaches, and in particular the simplicity and elegance of the Ruby implementation.

Upvotes: 6

Answers (8)

HMMHAQ

Reputation:

#include <conio.h>
#include <iostream.h>
#include <fstream.h>
#include <cstdlib>

struct stdt
{
       char name[20] ;
       int id ;

}; //std

int main()
{
      stdt boy ;
      int a = 0 ;
      ofstream TextFile ;
      cout << "Begin File Creation \n" ;
      TextFile.open("F:\\C++ Book Chapter Program\\Ch  7\\File.txt" );
      if ( !TextFile)
      {
           cerr <<"Erro 100 Openoing File.DAT" ;
           exit(100);     
      }//end if
      while ( a < 3 )
      {
            TextFile.write( (char*) &boy , sizeof (boy) ) ;
            cout << "\nEnter Name : " ;
            cin  >> boy.name;
            cout << "\nEnter ID : " ;
            cin  >> boy.id ;
            a++;
      }//end while

      TextFile.close();
      cout << "\nEnd File Creation" ;

      ifstream TextFile1 ;
      TextFile1.open("F:\\C++ Book Chapter Program\\Ch  7\\File.txt" );
      while ( TextFile1.read( (char*) &boy , sizeof (boy) ) )
      {
            cout << "\nEnter Name : " << boy.name; 
            cout << "\nEnter ID : " << boy.id ;


      }// end While

      getch();
      return 0 ;
}//end main

Upvotes: 0

dsm

Reputation: 10395

in Perl:

my %wordcount = ();
while(<>){map {$wordcount{$_}++} (split /\s+/)}
print "$_ = $wordcount{$_}\n" foreach sort keys %wordcount;

and in Perl Golf (just for fun):

my%w;                       
map{$w{$_}++}split/\s+/while(<>); 
print"$_=$w{$_}\n"foreach keys%w;

Upvotes: 1

yowzer

Reputation:

For individual words, there's no need to write a program at all unless this is part of something larger:

sed -e 's/[[:space:]]/\n/g' < file.txt | grep -c WORD

Upvotes: 2

Jonathan Leffler

Reputation: 754080

Does this count?

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
    char buffer[2048];
    if (argc != 2)
    {
        fprintf(stderr, "Usage: %s file\n", argv[0]);
        exit(EXIT_FAILURE);
    }
    snprintf(buffer, sizeof(buffer), "tr -cs '[a-z][A-Z]' '[\\n*]' < %s |"
                                     " sort | uniq -c | sort -n", argv[1]);
    return(system(buffer));
}

It basically encapsulates the canonical script illustrating how to count words on Unix as a shell script.

The 'tr' command translates anything that isn't an alphabetic character into a newline and squeezes out duplicates. The first 'sort' groups all occurrences of each word together. The 'uniq -c' counts the number of consecutive appearances of each word, printing the word and its count. The second 'sort' puts them in order of increasing repeats. You might need to dink with options to 'tr'; it isn't the stablest command from system to system, and it manages to routinely make me do manual bashing. On Solaris 10 using /usr/bin/tr, the code above produces (on its own source):

   1
   1 A
   1 EXIT
   1 FAILURE
   1 Usage
   1 Z
   1 a
   1 c
   1 cs
   1 exit
   1 file
   1 fprintf
   1 if
   1 main
   1 return
   1 sizeof
   1 snprintf
   1 stderr
   1 stdio
   1 stdlib
   1 system
   1 tr
   1 uniq
   1 z
   2 argc
   2 char
   2 h
   2 include
   2 int
   2 s
   2 sort
   3 argv
   3 n
   4 buffer

Upvotes: 4

martinus

Reputation: 18013

Just for the curious, here is a simple Ruby solution of the word count problem. It should be basically the same algorithm in C, just with a lot more code.

h = Hash.new(0)
File.read("filename.txt").split.each do |w|
  h[w] += 1
end
p h

Upvotes: 5

user19302

Reputation:

WARNING untested code:

#include <stdio.h>

struct LLNode
{
    LLNode* Next;    
    char*   Word;
    int     Count;
};

void PushWord(LLNode** list, const char* word)
{
    LLNode* node = NULL;
    unsigned int len = 0;
    if (*list == NULL) 
    {
        $list = new LLNode;
        $list = "\0";
    }
    node = *list;
    while ((node = node->Next) != NULL) // yes we are skipping the first node
    {
        if (!strcmp(node->Word, word))
        {
            node->Count++;
            break;
        }

        if (!node->Next)
        {
            LLNode* nnode = new LLNode;
            nnode->Count = 1;
            node->Next = nnode;
            len = strlen(word);
            node->Word = new char[len + 1];
            strcpy(node->Word, word);
            break;
        }
    }
}

void GetCounts(LLNode* list)
{
    if (!list)
        return;
    LLNode* node = list;
    while ((node = node->Next) != NULL) // yes we are skipping the first node
    {
        printf("Word: %s, Count: %i", node->Word, node->Count);
    }
}

void PushWords(LLNode** list, const char* words)
{
    char ch = '\0';
    unsigned int len = strlen(words);
    char buff[len]; // to be sure we have no buffer ovverunes. May consume too much memery for your application though.
    int index = 0;
    for (unsigned int i = 0; i < len; i++)
    {
        ch = words[i];
        if (index > 0 && ch == ' ')
        {
            ch[index + 1] = '\0';
            PushWords(list, buff);
            index = 0;
        }
        else if (ch != ' ')
        {
            ch[index++] = ch;
        }
    }

    if (index > 0 && ch == ' ')
    {
        ch[index + 1] = '\0';
        PushWords(list, buff);
        index = 0;
    }
}

int main()
{
    LLNode* list = NULL;
    PushWords(&list, "Hello world this is a hello world test bla");
    GetCount(list);
    // release out memery here
}

I wrote that just now so it proboly wont work - but that is the general idea.

Another rough draft this time in C++ (note: std::map has pretty good search times):

#include <iostream>
#include <string>
#include <map>

using namespace std;

typedef map<string, int> CountMap;

void PushWords(CountMap& list, const char* words)
{
    char ch = '\0';
    unsigned int len = strlen(words);
    string str;
    int index = 0;
    for (unsigned int i = 0; i < len; i++)
    {
        ch = words[i];
        if (index > 0 && ch == ' ')
        {
            list[str] = list[str] + 1;
            index = 0;
        }
        else if (ch != ' ')
        {
            str += ch;
            index++;
        }
    }

    if (index > 0 && ch == ' ')
    {
        list[str] = list[str] + 1;
    }
}

void PrintCount(CountMap& list)
{
    CountMap::iterator iter = list.begin(), end = list.end();
    for (; iter != end; ++iter)
    {
        cout << (*iter).first << " : " << (*iter).second;
    }
}


int main()
{
    CountMap map;
    PushWords(map, "Hello world this is a hello world test bla");
    PrintCount(map);
}

Upvotes: 0

ShreevatsaR

Reputation: 39083

Yes, a dictionary with word-occurence pairs would work fine, and the usual way to implement such a dictionary would be to use a hash table (or, sometimes, a binary search tree).

You could also use a trie (or its compressed version, "Patricia trie"/Radix trie) whose complexity is asymptotically optimal for this problem, although I suspect that in practice it might be slower than a (good) hash table implementation.

[I really think whether hash tables or tries are better depends on the distribution of words in your input -- e.g. a hash table will need to store each word in its hash bucket (to guard against collisions), while if you have a lot of words with common prefixes, in a trie those common prefixes are shared and need to be stored only once each, but there is still the overhead of all the pointers... if you do happen to try both, I'm curious to know how they compare.]

Upvotes: 6

Jared

Reputation: 39883

You could use a hash table and have every entry in the hash table point to a structure containing the word and number of times it has been found so far.

Upvotes: 2

Counting the number of occurrences of words in a textfile

Answers (8)

Related Questions