Reputation: 468
I am dealing with incoming stream of text. For example USA, UK, China, Russia, USA, UK, China, France, Germany.
I would need to break them up into sequence of 3 words (or maybe n words) and analyze which sequence has the highest frequency. On the above case, the sequence USA, UK, China occurs twice. So it has the highest frequency.
In addition, I would need to index the frequencies of all sequence. I have tried using C++ stl map to partially solve some of the problem, but I do not see the solution as elegant. The reason is to uniquely index m numbers of unique words, in a 3 words sequence using stl map, the mathematics is as below,
i x m x m + j x m + k
i, j, k being the integer map to each word.
The problem with the above solution is in a continuous stream of text, we are agnostic of the total number of unique words, or m. Can anyone suggest a better algorithm?
Upvotes: 1
Views: 670
Reputation: 468
map<vector<unsigned int>, unsigned int> sequenceFrequency;
vector<unsigned int> codedWord;
void MapSequenceFrequency(unsigned int key0, unsigned int key1, unsigned int key2)
{
codedWord[0] = key0;
codedWord[1] = key1;
codedWord[2] = key2;
map<vector<unsigned int>, unsigned int>::iterator it;
if (sequenceFrequency.find(codedWord) == sequenceFrequency.end())
sequenceFrequency[codedWord] = 0;
else
sequenceFrequency[codedWord]++;
}
Upvotes: 0
Reputation: 54
Another option is to use an std::string
as the key of the map.
Each key could be the concatenation of the 3 words. This way, you would define each triple uniquely with no need of knowing m
.
However, you'll have to implement an order operator for 2 strings and pass it as the third parameter on the declaration of the map, as discussed in this thread: std::string as a key in std::map using a compare operator.
Hope it helps!
Upvotes: 0
Reputation: 19601
I think you would be better using some sort of map or hash table of triples, because then you store only triples that actually occur, whereas with an array you make space for all possible triples. If you see n words, they might all be different, in which case you store about n triples - but an array for all triples of n different words would be of size n^3.
As a curiosity, there are bijective maps from pairs of non-negative integers to non-negative integers. One such is (a,b)->(a+b)(a+b+1)/2 + b which maps (0, 0) (0, 1) (1, 0) (0, 2) (1, 1) (2,1) ... to 0, 1, 2, 3, 4, 5, .. - think of it as numbering the pairs by writing them out in a square and then numbering down diagonals. You could use this twice to map triples of numbers to a single number: (a, b, c) -> ((a, b), c). However it isn't really very practical.
Upvotes: 2