Mad A.
Mad A.

Reputation: 431

Memory efficient map<pair<int,int>, set<int>> alternative

I have a huge amount (1500 Million) of Integer pairs where each one is associated with a document-ID. My goal now is to search for documents which have the same pair.

My first idea was to use a hash-map (std::map) using the pair values as keys and the document-IDs as associated values, i.e. map<pair<int,int>, unordered_set<int>>

For example:

Document1

 - pair1: (3, 9)
 - pair2: (5,13)

Document2

 - pair1: (4234, 13)
 - pair2: (5,13)

map<pair<int,int>, unordered_set<int>> hashMap
hashMap[{3, 9}].insert(1)
hashMap[{5, 13}].insert(1)

hashMap[{4234, 13}].insert(2)
hashMap[{5, 13}].insert(2)

would result into

Key(3,9) = Documents(1) 
Key(5,13) = Documents(1,2) 
Key(4234,13) = Documents(2)

My problem now is that this takes a huge amount of memory which exceeds my available 24 GB of RAM. Therefore I need an alternative with good performance for inserts and lookups which can fit into my memory. In theory I'm using 1500 Million * 3 (PairVal1, PairVal2, Document-ID) * 4 (bytes per Integer) = 18GB when overhead costs are not taking into account. So are there any good alternatives for my problem?

Upvotes: 2

Views: 822

Answers (3)

Zan Lynx
Zan Lynx

Reputation: 54325

This might be a job for an embedded database such as SQLite or BerkeleyDB or Tokyo Cabinet.

If the amount of data you're using exceeds your RAM then you really do need something that can work from disk.

Upvotes: 2

chema989
chema989

Reputation: 4172

One solution for reducing the space is instead of std::map<std::pair<int,int>, std::unordered_set<int>> use std::unordered_map<int, std::unordered_set<int>>

For convert std::pair<int, int> to int you must use a pairing function, for example:

Cantor’s Pairing Function

Obviously you're limited to using smaller numbers in your pairs.

The mapping for two maximum most 16 bit signed integers (32767, 32767) will be 2147418112 which is just short of maximum value for signed 32 bit integer.

Other option is create your own indexer based in a B-Tree, or using a Open Source Search Engine Library like xapian, it is written in C++ and is fast and easy to use.

Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications.

Upvotes: 0

John
John

Reputation: 7339

Can you use the file system?

Name directories after the first integer, create text files in each named for the second integer, each line of the text file can be a Document ID.

You're bound to suffer significant speed penalties on all of the I/O. Get as fast of a disk as you can. Storage requirements will grow significantly too, as directory names, file names, and file contents become ascii text instead of binary integers.

Upvotes: 0

Related Questions