Anna
Anna

Reputation: 13

How to store repeated data in an efficient way?

I am trying to find an efficient structure to store data in c++. Efficiency in both time and storage is important.

I have a set S = {0, 1, 2, 3, ..., N}, and multiple levels, say L levels. For each level l ∈ {0, 1, ..., L}, I need a structure, say M, to store 2 subsets of S, which will be the rows and columns of a matrix in the next steps. For example:

S = {0, 1, 2, 3, ..., 15} 
L = 2
L1_row = {0, 1, 2, 3, 4, 5}
L1_col = {3, 4, 5, 6, 9}
L2_row = {0, 2, 4, 5, 12, 14}
L2_col = {3, 6, 10, 11, 14, 15}

I have used an unordered_map with integer keys as level and a pair of unordered_sets for row and column, as follow:

unordered_map<int, pair<unordered_set<int>, unordered_set<int>>> M;

However, this is not efficient, for example, {3, 4, 5} recorded 3 times. As S is a large set, so M will contain many repeated numbers.

  1. In the next step, I will extract row and cols per level from M and create a matrix, so, fast access is important.
  2. M may or may not contain all items in S.
  3. M will fill in at 2 stages, first, rows for all levels, and then columns for all levels.

Upvotes: 0

Views: 264

Answers (1)

Hajo Kirchhoff
Hajo Kirchhoff

Reputation: 2075

That is a tough one. Memory and efficiency really depend on your concrete data set. If you don't want to store {3,4,5} 3 times, you'll have to create a "token" for it and use that instead.

There are patterns such as

  • flyweight or
  • run length encoding or
  • dictionary

for that. (boost-flyweight or dictionaries for ZIP/7Z and other compression algorithms).

However under some circumstances this can actually use more memory than just repeating the numbers.

Without further knowledge about the concrete use case it is very hard to suggest anything.

Example: Run length encoding is an efficient way to store {3,4,5,6,7...}. Basically you just store the first index and a length. So {3,4...12} becomes {{3, 10}}. That's easy to 'decode' and uses a lot less memory than the original sequence, but only if you have many consecutive sequences. If you have many short sequences it will be worse.

Another example: If you have lots of recurring patterns, say, {2,6,11} appears 23 times, then you could store this pattern as a flyweight or use a dictionary. You replace {2,6,11} with #4. The problem here is identifying patterns and optimizing your dictionary. That is one of the reasons why compressing files (7Zip, bzip etc...) takes longer than uncompressing.

Or you could even store the info in a bit pattern. If your matrix is 1024 columns wide, you could store the columns used in 128 Bytes with "bit set" == "use this column". Like a bitmask in image manipulation.

So I guess my answer really is: "it depends".

Upvotes: 2

Related Questions