Reputation: 23

Count the number of times each word occurs in a file

Hi I am writing a program that counts the number of times each word occurs in a file. Then it prints a list of words with counts between 800 and 1000, sorted in the order of count. I am stuck on keeping a counter to see if the first word matches the next until a new word appears. In the main I am trying to open the file, read each word by word and call sort in the while loop to sort the vector. Then, in the for loop go through all the words and if the first word equals the second count++. I don't think that is how you keep a counter.

Here is the code:

#include <string>
#include <iostream>
#include <fstream>
#include <vector>
#include <algorithm>
#include <set>

using namespace std;

vector<string> lines;
vector<int> second;
set<string> words;
multiset<string> multiwords;

void readLines(const char *filename)
{
    string line;
    ifstream infile;
    infile.open(filename);
    if (!infile)
    {       
        cerr << filename << " cannot open" << endl; 
          return; 
    }       
    getline(infile, line);
    while (!infile.eof())
    {
        lines.push_back(line);
        getline(infile, line);
    }  
    infile.close();
}

int binary_search(vector<string> &v, int size, int value)
{
    int from = 0;
    int to = size - 1;
    while (from <= to)
    {  
        int mid = (from + to) / 2;
        int mid_count = multiwords.count(v[mid]);
        if (value == mid_count) 
            return mid;
        if (value < mid_count) to = mid - 1;
        else from = mid + 1;
    }
   return from;
}

int main() 
{
    vector<string> words;
    string x;
    ifstream inFile;
    int count = 0;

    inFile.open("bible.txt");
    if (!inFile) 
    {
        cout << "Unable to open file";
        exit(1);
    }
    while (inFile >> x){
        sort(words.begin(), words.end());
    }

    for(int i = 0;i < second.size();i++)
    {
        if(x == x+1)
        {
            count++;
        }
        else
            return;
    }
    inFile.close();
}

Upvotes: 2

Answers (5)

A M

Reputation: 15275

With more modern features available in C++ 20, we can now give an improved answer. By using newly available STL container, we can achive the whole task which just a few statements.

There is nearly a universal solution approach for "counting". We can use the std::unordered_map. It is described in the C++ reference here.

It is the std::unordered_maps very convienient index operator [] which makes counting very simple. This operator returns a reference to the value that is mapped to a key. So, it searched for the key and then returns the value. If the key does not exist, it inserts a new key/value pair and returns a reference to the value.

So, in any way, a reference to the value is returned. And this can be incremented. Example:

With a "std::unordered_map<char, int> mymap{};" and a text "aba", the follwoing can be done with the index operator:

mymap['a'] will search for an 'a' in the map. It is not found, so a new entry for 'a' with corresponding value=0 is created: The a reference to the value is returned. And, if we now increment that, we will increment the counter value. So, mymap['a']++, wiil insert a new gey/value pair 'a'/0, then increment it, resulting in 'a'/1
For 'b' the same as above will happen.
For the next 'a', an entry will be found in the map, an so a reference to the value (1) is returned. This is incremented and will then be 2

And so on and so on.

Of course we will use a std::string as key in our example.

Sorting is similar simple. We just insert everything in a std::multimap, with key and value exchanged.Then everything is sorted according to the frequency.

With the very convienient lower_bound and upper_bound function of the std::multimap we can find and display the requested values easily.

All this will result in a compact code example:

#include <iostream>
#include <fstream>
#include <string>
#include <unordered_map>
#include <map>
#include <iterator>
#include <ranges>
#include <iomanip>

namespace rng = std::ranges;                // Abbreviation for the ranges namespace

int main() {

    // Open bible text file and check, if it could be opened
    if (std::ifstream ifs{ "r:\\bible.txt" }; ifs) {
        
        // Read all words from file and count them
        std::unordered_map<std::string, std::size_t> counter{};
        for (const auto& word : rng::istream_view<std::string>(ifs)) counter[word]++;

        // Sort the words according to their frequency
        std::multimap<std::size_t, std::string> sorter{};
        for (const auto& [word, count] : counter) sorter.insert({ count, word });

        // Show words with frequency given in a certain range
        for (const auto& [count, word] : rng::subrange{ sorter.lower_bound(800), sorter.upper_bound(1000) })
            std::cout << std::setw(25) << word << " --> " << count << '\n';
    }
    else std::cerr << "\n*** Error: Could not open source text file\n";
}

Upvotes: 0

sehe

Reputation: 394054

Just for fun, I did a solution in c++0x style, using Boost MultiIndex.

This style would be quite clumsy without the auto keyword (type inference).

By maintaining the indexes by word and by frequency at all times, there is no need to remove, partition, nor sort the wordlist: it'll all be there.

To compile and run:

g++ --std=c++0x -O3 test.cpp -o test
curl ftp://ftp.funet.fi/pub/doc/bible/texts/english/av.tar.gz |
    tar xzO | sed 's/^[ 0-9:]\+//' > bible.txt
time ./test

#include <boost/foreach.hpp>
#include <boost/lambda/lambda.hpp>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/member.hpp>
#include <fstream>
#include <iostream>
#include <string>

using namespace std;

struct entry 
{
    string word;
    size_t freq;
    void increment() { freq++; }
};

struct byword {}; // TAG
struct byfreq {}; // TAG

int main() 
{
    using ::boost::lambda::_1;
    using namespace ::boost::multi_index;
    multi_index_container<entry, indexed_by< // sequenced<>,
            ordered_unique    <tag<byword>, member<entry,string,&entry::word> >, // alphabetically
            ordered_non_unique<tag<byfreq>, member<entry,size_t,&entry::freq> > // by frequency
                > > tally;

    ifstream inFile("bible.txt");
    string s;
    while (inFile>>s)
    {
        auto& lookup = tally.get<byword>();
        auto it = lookup.find(s);

        if (lookup.end() != it)
            lookup.modify(it, boost::bind(&entry::increment, _1));
        else
            lookup.insert({s, 1});
    }

    BOOST_FOREACH(auto e, tally.get<byfreq>().range(800 <= _1, _1 <= 1000))
        cout << e.freq << "\t" << e.word << endl;

}

Note how

it became just slightly more convenient to define a custom entry type instead of using std::pair
(for obvious reasons), this is slower than my earlier code: this maintains the index by frequency during the insertion phase. This is unnecessary, but it makes for much more efficient extraction of the [800,1000] range:

tally.get<byfreq>().range(800 <= _1, _1 <= 1000)

The multi-set of frequencies is already ordered. So, the actual speed/memory trade of might tip in the favour of this version, especially when documents would be large and contain very few duplicated words (alas, this is a property known not to hold for the corpus text of the bible, lest someone translate it to neologorrhea).

Upvotes: 0

sehe

Reputation: 394054

He. I know bluntly showing a solution is not really helping you. However.

I glanced through your code and saw many unused and confused bits. Here's what I'd do:

#include <algorithm>
#include <fstream>
#include <functional>
#include <iostream>
#include <iterator>
#include <map>
#include <string>
#include <vector>

using namespace std;

// types
typedef std::pair<string, size_t> frequency_t;
typedef std::vector<frequency_t> words_t;

// predicates
static bool byDescendingFrequency(const frequency_t& a, const frequency_t& b)
{ return a.second > b.second; }

const struct isGTE // greater than or equal
{ 
    size_t inclusive_threshold;
    bool operator()(const frequency_t& record) const 
        { return record.second >= inclusive_threshold; }
} over1000 = { 1001 }, over800  = { 800 };

int main() 
{
    words_t words;
    {
        map<string, size_t> tally;

        ifstream inFile("bible.txt");
        string s;
        while (inFile >> s)
            tally[s]++;

        remove_copy_if(tally.begin(), tally.end(), 
                       back_inserter(words), over1000);
    }

    words_t::iterator begin = words.begin(),
                      end = partition(begin, words.end(), over800);
    std::sort(begin, end, &byDescendingFrequency);

    for (words_t::const_iterator it=begin; it!=end; it++)
        cout << it->second << "\t" << it->first << endl;
}

Authorized Verion:

993 because
981 men
967 day
954 over
953 God,
910 she
895 among
894 these
886 did
873 put
868 thine
864 hand
853 great
847 sons
846 brought
845 down
819 you,
811 so

Vulgata:

995 tuum
993 filius
993 nec
966 suum
949 meum
930 sum
919 suis
907 contra
902 dicens
879 tui
872 quid
865 Domine
863 Hierusalem
859 suam
839 suo
835 ipse
825 omnis
811 erant
802 se

Performance is about 1.12s for for both files, but only 0.355s after drop-in replacing map<> with boost::unordered_map<>

Upvotes: 3

Sarfaraz Nawaz

Reputation: 361812

One solution could be this : define letter_only locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".

struct letter_only: std::ctype<char> 
{
    letter_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
        return &rc[0];
    }
};

And then use it as:

int main()
{
     std::map<std::string, int> wordCount;
     ifstream input;

     //enable reading only english letters only!
     input.imbue(std::locale(std::locale(), new letter_only())); 

     input.open("filename.txt");
     std::string word;
     std::string uppercase_word;
     while(input >> word)
     {
         std::transform(word.begin(), 
                        word.end(), 
                        std::back_inserter(uppercase_word),
                        (int(&)(int))std::toupper); //the cast is needed!
         ++wordCount[uppercase_word];
     }
     for (std::map<std::string, int>::iterator it = wordCount.begin(); 
                                               it != wordCount.end(); 
                                               ++it)
     {
           std::cout << "word = "<< it->first 
                     <<" : count = "<< it->second << std::endl;
     }
}

Upvotes: 2

David Rodríguez - dribeas

Reputation: 208476

A more efficient approach can be done with a single map< string, int > of occurrences, read words one by one, and increment the counter in m[ word ]. After all words have been accounted for, iterate over the map, for words in the given range, add them to a multimap<int, string>. Finally dump the contents of the multimap, that will be ordered by number of occurrences and alphabetical order...

Upvotes: 2

Count the number of times each word occurs in a file

Answers (5)

Related Questions