user123
user123

Reputation: 5407

Creating hashtable in C++ for string manipulation

I am trying to process content of article or any paragraph [each string]. First I will convert into words using strtok().

After that I want store each word in hashtable (because I think it's only best way to process big data). While dealing with each word, I want to store occurrence of each words. And at the end I want to get words which occurring max time.

unordered_map stores elements with key values and allows fast retrieval of elements with key. This may be useful to me.

I am not good with C++, so want some opinions.

  1. Storing entire contains in char *ch ="content of article" is good way to proceed or string::str? I am familiar with first one only. for 2nd I feel complex during working with functions.

  2. Storing entire content(strings) into unordered_map(), then How can I create hashtable which contains element as words, and it's occurrence with it. And then Can I get words with max occurance?

  3. Is there any other C++ function which can help me out to do what I want.

Upvotes: 0

Views: 3844

Answers (5)

fredoverflow
fredoverflow

Reputation: 263320

I want store each word in hashtable (because I think it's only best way to process big data). While dealing with each word, I want to store occurrence of each words.

Here is some pseudo C++ to get you started:

std::unordered_map<std::string, int> occurrences;
while (more_words_available)
{
    std::string word = fetch_next_word();
    ++occurrences[word];
}

how I print occurrence count values for each word in while?

Do you have a C++11 compiler? Then use the new foreach loop:

for (auto p : occurrences)
{
    std::cout << p.first << " occurred " << p.second << " times.\n";
}

Otherwise, use the traditional for loop with iterators:

for (std::unordered_map<std::string, int>::iterator it = occurrences.begin();
                                                    it != occurrences.end();
                                                    ++it)
{
    std::cout << it->first << " occurred " << it->second << " times.\n";
}

Upvotes: 2

cpp
cpp

Reputation: 3801

If your article is in file test.txt then you may create your map like that:

#include<fstream>
#include<map>
#include<string>

using namespace std;
int main()
{
    ifstream in_file("test.txt");
    map<string,int> words;

    string tword;
    while(in_file >> tword)  //line 12
        words[tword]++;
}

You can also store entire content in istringstream ss and use it instead of in_file above:

while(ss >> twords)  //line 12

Upvotes: 1

James Kanze
James Kanze

Reputation: 154027

You don't need (nor want) strtok. If white space is the separator for words, just reading into a string using >> will do the trick; the entire input phase would be:

std::unordered_map<std::string, int> counts;
std::string word;
while ( source >> word ) {
    ++ counts[word];
}

Depending on the requirements, you might want to do things like converting the word to lower case before counting it, or stripping final punctuation from it (so that word, Word and Word. are all the same).

For accessing sorted by count, the simplest would be to copy the contents of the map into an std::vector<std::pair<std::string, int>> and sorting that. (Don't forget that you can construct a vector from two iterators. So this is just two more lines.)

Upvotes: 1

Patrik Beck
Patrik Beck

Reputation: 2505

It seems the data structure you need will need to do several operations: looking up by key (word) and string values (count) for each key. But you also want to be able to print the frequent works, in which case you needs sorting by value.

None of the standard containers handle this out of the box. Since the first operation will be happening frequently, and the second just once, you should select container that suites best for the first operation.

Both std::map and std::unordered_map would do well.

Try following:

std::map<std::string, int>

or

unordered_map std::map<std::string, int>

For printing all the works in order of frequency, you will have to copy it into another structure and then sort it. Or it in in one operation. You can copy everything into

std::map<int, std::string>

and then just print it.

Upvotes: 3

aamir
aamir

Reputation: 151

  1. working with string is always easier
  2. Words can be used as keys and count as value. Retrieval based on key is fast from a unordered_map. Getting words with max count will require iterating over the whole map. Your problem is that you need 2 indices.
  3. Consider using Boost::MultiIndex for creating 2 indices in a container.

Upvotes: 1

Related Questions