Reputation: 1044
I'm working on a piece of code that has the objective of being a fast "search engine". I have entries in a file that need to be searchable after reading the whole file in. They need to be searchable by both the name of the entry, and it's offset from the beginning of the file. My problem is one of memory usage since there are millions of entries. Currently I am using two separate std::maps to store the data so that either search term can be specified. This leads to double storage of the data which is what I am trying to reduce.
I've used valgrind massif to find that a major part of the memory usage is the double storage of entries.
Current method of storage:
struct entry {
std::string name;
uint16_t offset;
uint16_t size;
bool isConst;
};
nameSearchMap.insert(std::pair<std::string, entry>(s_entry.name, e_entry));
offsetSearchMap.insert(std::pair<uint16_t, SymInfo>(s_entry.offset, s_entry));
Is there a way that I can can make a single map that is searchable by either type of key?
Upvotes: 2
Views: 3325
Reputation: 5658
You can use Boost.MultiIndex as follows:
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index/member.hpp>
#include <cstdint>
#include <string>
struct entry {
std::string name;
uint16_t offset;
uint16_t size;
bool isConst;
};
using namespace boost::multi_index;
using entry_map=multi_index_container<
entry,
indexed_by<
hashed_unique<tag<struct by_name>,member<entry,std::string,&entry::name>>,
hashed_unique<tag<struct by_offset>,member<entry,uint16_t,&entry::offset>>
>
>;
#include <iostream>
int main()
{
entry_map m={{"hello",0,10,false},{"hi",1,20,true},{"bye",2,15,false}};
auto& name_map=m.get<by_name>();
auto& offset_map=m.get<by_offset>();
std::cout<<"size(\"hello\"): "<<name_map.find("hello")->size<<"\n";
std::cout<<"size(2): "<<offset_map.find(2)->size<<"\n";
}
Output:
size("hello"): 10 size(2): 15
Upvotes: 0
Reputation: 8220
You can save three poiner per entry, and avoid entry duplication. But method usable only if insertion is not so often than fetch:
std:vector<std::string> names;
std:vector<entry> entries;
std:map<uint16_t,size_t> offset2pos;
names and entryes must be sorted identically, so pos of name equal entry with that name respectively
names[pos] == entries[pos].name
and map offset2pos can convert offset to pos
size_t pos = offset2pos[offset]
to convert name to pos use binary search (std::lower_bound)
size_t pos = names.begin() - std::lower_bound( names.begin(), names.end(), offset )
This technuque good for that cases when insertion is not so often than fetch. But algoritm is no so trivial, and you need to reassign offset2pos (possible rebuild) each time you insert one or more elements.
In the answer above You need save per element:
std::string, uint16_t, 2 * Sptr, 4 * Mptr
In the my approuch :
std::string, uint16_t, Sptr, 2 * Mptr
Where: Sptr - is smaprt pointer object internal pointer Mptr - map binary three node left and right pointer
So 3 pointer per entry is your memory profit.
Upvotes: 0
Reputation: 206557
They need to be searchable by both the name of the entry, and it's offset from the beginning of the file.
This indicates to me that you can store the data in an array/vector and use maps that map to indices in the array/vector.
std::vector<entry> entries;
std::map<std::string, size_t> map1;
std::map<uint16_t, size_t> map2;
Upvotes: 2
Reputation: 76297
You might consider using
std::map<std::string, std::shared_ptr<entry>>
for mapping strings to an entry, and
std::map<uint16_t, std::shared_ptr<entry>>
Note that by using shared pointers for the value payload (thus using the same entry
object for both maps), you save the size of a payload. While you pay for two shared pointers, you'll still come out ahead for your specific structure.
(felt like drawing a diagram. The point, though, is that there is only one entry
object in memory.)
You might also be interested in boost::bimap
.
Upvotes: 6