Reputation: 107

Count the number of unique words (case does not matter for this count)

Hey so I'm having trouble figuring out the code to count the number of unique words. My thought process in terms of psudeocode was first making a vector so something like vector<string> unique_word_list;Then I would get the program to read each line so I would have something likewhile(getline(fin,line)). The hard part for me is coming up with the code where I check the vector(array) to see if the string is already in there. If it's in there I just increase the word count(simple enough) but if its not in there then I just add a new element to the vector. I would really appreciate if someone could help me out here. I feel like this is not hard but for some reason I can't think of the code for comparing the string with whats inside of the array and determining if its a unique word or not.

Upvotes: 1

Answers (2)

vsoftco

Reputation: 56557

Cannot help myself writing an answer that makes use of C++ beautiful library. I'd do it like this, with a std::set:

#include <algorithm>
#include <cctype>
#include <string>
#include <set>
#include <fstream>
#include <iterator>
#include <iostream>

int main()
{
    std::ifstream ifile("test.txt");
    std::istream_iterator<std::string> it{ifile};
    std::set<std::string> uniques;
    std::transform(it, {}, std::inserter(uniques, uniques.begin()), 
        [](std::string str) // make it lower case, so case doesn't matter anymore
        {
            std::transform(str.begin(), str.end(), str.begin(), ::tolower);
            return str; 
        });
    // display the unique elements
    for(auto&& elem: uniques)
        std::cout << elem << " ";

    // display the size:
    std::cout << std::endl << uniques.size();
}

You can also define a new string type in which you change the char_traits so the comparison becomes case-insensitive. This is the code you'd need (much more lengthy than before, but you may end up reusing it), the char_traits overload is copy/pasted from cppreference.com:

#include <algorithm>
#include <cctype>
#include <string>
#include <set>
#include <fstream>
#include <iterator>
#include <iostream>

struct ci_char_traits : public std::char_traits<char> {
    static bool eq(char c1, char c2) { return toupper(c1) == toupper(c2); }
    static bool ne(char c1, char c2) { return toupper(c1) != toupper(c2); }
    static bool lt(char c1, char c2) { return toupper(c1) <  toupper(c2); }
    static int compare(const char* s1, const char* s2, size_t n) {
        while ( n-- != 0 ) {
            if ( toupper(*s1) < toupper(*s2) ) return -1;
            if ( toupper(*s1) > toupper(*s2) ) return 1;
            ++s1; ++s2;
        }
        return 0;
    }
    static const char* find(const char* s, int n, char a) {
        while ( n-- > 0 && toupper(*s) != toupper(a) ) {
            ++s;
        }
        return s;
    }
};

using ci_string = std::basic_string<char, ci_char_traits>;

// need to overwrite the insertion and extraction operators, 
// otherwise cannot use them with our new type 
std::ostream& operator<<(std::ostream& os, const ci_string& str) {
    return os.write(str.data(), str.size());
}

std::istream& operator>>(std::istream& os, ci_string& str) {
    std::string tmp;
    os >> tmp;
    str.assign(tmp.data(), tmp.size());
    return os;
}

int main()
{
    std::ifstream ifile("test.txt");
    std::istream_iterator<ci_string> it{ifile};
    std::set<ci_string> uniques(it, {}); // that's it

    // display the unique elements
    for (auto && elem : uniques)
        std::cout << elem << " ";

    // display the size:
    std::cout << std::endl << uniques.size();
}

Upvotes: 3

Barry

Reputation: 303047

Don't use a vector - use a container that maintains uniqueness, like std::set or std::unordered_set. Just convert the string into lower case (using std::tolower) before you add it:

std::set<std::string> words;
std::string next;
while (file >> next) {
    std::transform(next.begin(), next.end(), next.begin(), std::tolower);
    words.insert(next);
}

std::cout << "We have " << words.size() << " unique words.\n"

Upvotes: 6

Count the number of unique words (case does not matter for this count)

Answers (2)

Related Questions