Reputation: 93
I have a map of string and unsigned, in which I store a word to its frequency of the following form:
map<string,unsigned> mapWordFrequency; //contains 1 billion such mappings
Then I read a huge file (100GB), and retain only those words in the file which have a frequency greater than 1000. I check for the frequency of the words in the file using: mapWordFrequency[word]>1000. However, it turns out as my mapWordFrequency has 1 billion mappings and my file is huge, therefore trying to check mapWordFrequency[word]>1000 for each and every word in the file is very slow and takes more than 2 days. Can someone please suggest as to how can I improve the efficiency of the above code.
map does not fit in my RAM and swapping is consuming a lot of time.
Would erasing all words having frequency < 1000 help using erase function of map?
Upvotes: 5
Views: 1228
Reputation: 2169
You can use hash map where your hashed string will be key and occurrence will be value. It will be faster. You can choose a good string hashing based on your requirement. Here is link of some good hashing function:
http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
you can use some third party libraries for this also.
EDIT: pseudo code
int mapWordFrequency[MAX_SIZE] = {0} ;// if MAX_SIZE is large go with dynamic memory location
int someHashMethod(string input);
loop: currString in ListOfString
int key = someHashMethod(currString);
++mapWordFrequency[key];
if(mapWordFrequency[key] > 1000)
doSomeThing();
Update: As @Jens pointed out there can be cases when someHashMethod() will return same int (hash) for two different string. In that case we have to resolve collision and then lookup time will be more than constant. Also as input size is very large creating a single array of that size may not be possible. In that case we may use distributed computing concepts but actual lookup time will again go up as compare to single machine.
Upvotes: 2
Reputation: 106096
@TonyD Can you please give a little example with trie? – Rose Sharma
Here's an example of a trie approach to this problem:
#include <iostream>
#include <string>
#include <limits>
#include <array>
class trie
{
public:
void insert(const std::string& s)
{
node_.insert(s.c_str());
}
friend std::ostream& operator<<(std::ostream& os, const trie& t)
{
return os << t.node_;
}
private:
struct Node
{
Node() : freq_(0) { }
uint16_t freq_;
std::array<Node*, 26> next_letter_{};
void insert(const char* p)
{
if (*p)
{
Node*& p_node = next_letter_[*p - 'a'];
if (!p_node)
p_node = new Node;
p_node->insert(++p);
}
else
if (freq_ < std::numeric_limits<decltype(freq_)>::max()) ++freq_;
}
} node_;
friend std::ostream& operator<<(std::ostream& os, const Node& n)
{
os << '(';
if (n.freq_) os << n.freq_ << ' ';
for (size_t i = 0; i < 26; ++i)
if (n.next_letter_[i])
os << char('a' + i) << *(n.next_letter_[i]);
return os << ')';
}
};
int main()
{
trie my_trie;
my_trie.insert("abc");
my_trie.insert("abcd");
my_trie.insert("abc");
my_trie.insert("bc");
std::cout << my_trie << '\n';
}
Output:
(a(b(c(2 d(1 ))))b(c(1 )))
The output is a compressed/tree-like representation of your word-frequency histogram: abc
appears 2
times, abcd
1
, bc
1
. The parentheses can be thought of as pushing and popping characters from a "stack" to form the current prefix or - when there's a number - word.
Whether it improves much on a map depends on the variations in the input words, but it's worth a try. A more memory efficient implementation might use a vector
or set
- or even a string
of say space-separated- suffixes when there are few elements beneath the current prefix, then switch to the array-of-26-pointers when that's likely to need less memory.
Upvotes: 1
Reputation: 9991
The main problem seems to be the memory footprint, so we are looking for a solution that uses up little memory. A way to save memory is to use sorted vector
s instead of a map
. Now, vector
has a lookup time with ~log(n) comparisons and an average insert time of n/2, which is bad. The upside is you have basically no memory overhead, the memory to be moved is small due to separation of data and you get sequential memory (cache-friendliness) which can easily outperform a map
. The required memory would be 2 (wordcount) + 4 (index) + 1 (\0
-char) + x (length of word) bytes per word. To achieve that we need to get rid of the std::string
, because it is just too big in this case.
You can split your map
into a vector<char>
that saves the strings one after another separated by \0
-characters, a vector<unsigned int>
for the index and a vector<short int>
for the word count. The code would look something like this (tested):
#include <vector>
#include <algorithm>
#include <cstring>
#include <string>
#include <fstream>
#include <iostream>
std::vector<char> strings;
std::vector<unsigned int> indexes;
std::vector<short int> wordcount;
const int countlimit = 1000;
void insertWord(const std::string &str) {
//find the word
auto stringfinder = [](unsigned int lhs, const std::string &rhs) {
return &strings[lhs] < rhs;
};
auto index = lower_bound(begin(indexes), end(indexes), str, stringfinder);
//increment counter
if (index == end(indexes) || strcmp(&strings[*index], str.c_str())) { //unknown word
wordcount.insert(begin(wordcount) + (index - begin(indexes)), 1);
indexes.insert(index, strings.size());
strings.insert(end(strings), str.c_str(), str.c_str() + str.size() + 1);
}
else { //known word
auto &count = wordcount[index - begin(indexes)];
if (count < countlimit) //prevent overflow
count++;
}
}
int main() {
std::ifstream f("input.txt");
std::string s;
while (f >> s) { //not a good way to read in words
insertWord(s);
}
for (size_t i = 0; i < indexes.size(); ++i) {
if (wordcount[i] > countlimit) {
std::cout << &strings[indexes[i]] << ": " << wordcount[i] << '\n';
}
}
}
This approach still saves all words in memory. According to Wolfram Alpha the average word length in the English language is 5.1 characters. This gives you a total memory requirement of (5.1 + 7) * 1bn bytes = 12.1bn bytes = 12.1GB. Assuming you have a halfway modern computer with 16+GB of RAM you can fit it all into RAM.
If this fails (because you don't have English words and they don't fit in memory), the next approach would be memory mapped files. That way you can make indexes
point to the memory mapped file instead of strings
, so you can get rid of strings
, but the access time would suffer.
If this fails due to low performance you should look into map-reduce which is very easy to apply to this case. It gives you as much performance as you have computers.
Upvotes: 1
Reputation: 4357
You need another approach to your problem, your data is too big to be processed all at once. For example you could split your file into multiple files, let's say the easiest would be to logically splitting them by letters.
100GB/24 letters = 4.17 GB
Now you'll have 24
files of 4.17GB
each.
You know that the words in any of the files can not be part of any other file, this will help you, since you won't have to merge the results.
With a 4GB file, it now gets easier to work in ram.
std::map
has a problem when you start using a lot of memory, since it fragments a lot. Try std::unordered_map
, and if that's still not performing well, you may be able to load in memory the file and sort it. Counting occurrences will be then quite easy.
Assuming you have several duplicates, your map
or unordered_map
will have a significantly lower memory footprint.
Run your code in a loop, for each file, and append the results in another file. You should be done quite quickly.
Upvotes: 1
Reputation: 8441
Depending on the statistical distribution of your words, it may be worth compressing each word before adding it to the map. As long as this is lossless compression you can recover the original words after filtering. The idea being you may be able to reduce the average word size (hence saving memory, and time comparing keys). Here's a simple compression/decompression procedure you could use:
#include <string>
#include <sstream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <boost/iostreams/copy.hpp>
inline std::string compress(const std::string& data)
{
std::stringstream decompressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_compressor());
stream.push(decompressed);
std::stringstream compressed {};
boost::iostreams::copy(stream, compressed);
return compressed.str();
}
inline std::string decompress(const std::string& data)
{
std::stringstream compressed {data};
boost::iostreams::filtering_streambuf<boost::iostreams::input> stream;
stream.push(boost::iostreams::zlib_decompressor());
stream.push(compressed);
std::stringstream decompressed;
boost::iostreams::copy(stream, decompressed);
return decompressed.str();
}
In addition to using a std::unordered_map
as other have suggested, you could also move any words that have already been seen more than 1000 times out of the map, and into a std::unordered_set
. This would require also checking the set before the map, but you may see better hash performance by doing this. It may also be worth rehashing occasionally if you employ this strategy.
Upvotes: 1
Reputation: 4343
I suggest you use an unordered_map
as opposed to a map
. As already discussed in the comments, the former will give you an insertion/retrieval time of O(1)
as opposed to O(logn)
in a map
.
As you have already said, memory swapping is consuming a lot of time. So how about tackling the problem incrementally. Load the maximum data and unordered_map
you can into memory, hash it, and continue. After one pass, you should have a lot of unordered_maps, and you can start to combine them in subsequent passes.
You could improve the speed by doing this in a distributed manner. Processing the pieces of data on different computers, and then combining data (which will be in form of unordered maps. However, I have no prior experience in distributed computing, and so can't help beyond this.
Also, if implementing something like this is too cumbersome, I suggest you use external mergesort. It is a method of sorting a file too large to fit into memory by sorting smaller chunks and combining them. The reason I'm suggesting this is that external mergesort is a pretty common technique, and you might find already implemented solutions for your need. Even though the time complexity of sorting is higher than your idea of using a map
, it will reduce the overhead in swapping as compared to a map. As pointed out in the comments, sort
in linux implements external mergesort.
Upvotes: 4