Reputation: 7221

STL containers and large amounts of data

I have a large collection of data that is read into memory - temporarily, but necessary for the system.

I have been checking the performance of std::vector as well as std::unordered_map.
For std::vector I used a struct of type:

struct information{
    std::string name;
    unsigned int offset;
}

For std::unordered_map I used the std::string for the key and the unsigned int offset for the value.

If, let's say, 2 000 000 of these are loaded into memory, I tried the following and got these results:

std::vector:

On random string, never really larger than 32 characters if a reserve was called on the vector.

std::vector<information> vec;
vec.reserve(2500000);

The insertion

vec.push_back({dataName, offset});

is quite fast. Trying to find data is very slow though. The find was implemented like this:

auto it = std::find_if(vec.begin(), vec.end(), [&name](information &info) -> bool {return info.name == name); });

Which makes sense seeing that it is a large vector and the correct struct is found on a name compare. But it was extremely poor performance. The memory used was fine - I assume a part of the memory growth was due to the std::string size resizing.

My question on the vector implementation is: Is there a way to increase the look up time? I know that a vector can be sorted to increase your look-up time, but then you lose time in sorting the vector. Especially on a vector of this size.

std::unordered_map:

The insertion

std::unordered_map<std::string, unsigned int> unordMap;
unordMap.reserve(2500000);
unordMap.emplace(name, offset);

takes a very long time. When reserving space beforehand in an attempt to shorten the insertion time the following happens:

The memory at the end of insertion is a lot more when not calling reserve, without reserve the memory is still a lot more than the vector implementation. The reserve doesn't really improve insertion time.

Of course the look up is very fast. My question about the std::unordered_map is can the insertion time and memory usage be improved?

If neither of these can be done, then my next question will probably be quite obvious. Is there a way to get a result in between these two data structures? What is the best for large amounts of data?

Upvotes: 4

Answers (4)

Nelfeal

Reputation: 13269

The problem with a large vector is the lookup time when you don't know the index of objects you want. One way to improve it is, as you stated, to keep an ordered vector and do binary search on it. That way, lookup time will not be of linear complexity but rather of logarithmic complexity, which saves up quite a lot of time with very large containers. This is the lookup used in std::map (the ordered one). You can do a similar binary search using std::lower_bound or std::equal_range on your std::vector.

The problem with a large unordered map is completely different : this kind of container use a hash function and modulus calculation in order to place the elements according to their keys in a standard array. So when you have a n elements in a std::unordered_map, it is very unlikely that you only need an n-elements-long array, because some indices will not be filled. You will use at least the greatest index produced by the hash-and-modulo. One way to improve memory usage as well as insertion time is to write your own hash function. But it might be hard depending on what kind of strings you are using.

Upvotes: 1

Yakk - Adam Nevraumont

Reputation: 275405

struct information{
  std::string name;
  unsigned int offset;
  information(information const&)=default;
  information(information&&)=default;
  information(std::string n, unsigned o):name(std::move(n)),offset(o),hash(std::hash<std::string>()(name)) {};
  information():information("",0) {};
  bool operator<( information const& o ) const {
    return tie() < o.tie();
  }
  std::tuple<std::size_t, std::string const&> tie() const { return std::tie(hash, name); }
private:
  std::size_t hash;
};

Use the above structure for your std::vector.

After adding all the data, std::sort it.

To find something matching name do this:

struct information_searcher {
  struct helper {
    std::tuple<std::size_t, std::string const&> data;
    helper( std::string const& o ):data(std::hash<std::string>()(o), o) {};
    helper( helper const& o ) = default;
    helper( information const& o ):data(o.tie()) {}
    bool operator<( helper const& o ) const { return data < o.data; }
  };
  bool operator()( helper lhs, helper rhs ) const { return lhs < rhs; }
};

information* get_info_by_name( std::string const& name ) {
  auto range = std::equal_range( vec.begin(), vec.end(), information_searcher::helper(name), information_searcher{} );
  if (range.first == range.second) {
    return nullptr;
  } else {
    return &*range.first;
  }
}

which is a nearly zero-overhead lookup.

What we do here is hash the strings (for fast comparison), falling back on std::string comparison if we have a collision.

information_searcher is a class that lets us search the data without having to create an information (which would require a wasteful allocation).

get_info_by_name returns a pointer -- nullptr if not found, and a pointer to the first element with that name otherwise.

Changing information.name is impolite, and makes the hash field incorrect.

This may use moderately more memory than the naive std::vector version.

In general, if your work consists of 'add a bunch of stuff to a table' then 'do a bunch of lookups', your best bet is to build a std::vector, sort it in a fast way, then use equal_range to do the lookups. map and unordered_map are optimized for lots of mixed inserts/deletes/etc.

Upvotes: 4

firda

Reputation: 3338

vector is usually implemented as 'dynamic array' and should be the most memory-efficient. with good reservation strategy it can have insert of O(1) = fast. Searching is O(n) = very bad.

You can help vector by sorting it (and if you first load then search than I think it would be best - std::sort + std::binary_search).
You can as well implement something like insert-sort using std::lower_bound. insert = O(log n) = good, search = O(log n) = good

map (ordered) can actually do the same work, but may as well be implemented using tree = less memory efficient, access as good as sorted vector (but maybe less reallocation, but in your case, sorted vector is still best)

unordered_map is usually imlemented using hash tables = some memory overhead but fast operations (insertion cannot be that fast as in unsorted vector, but should still be pretty fast). The problen with hashing is that it can be fast and even fastest, but may as well be the worst solution (in extreme conditions). The above structures (sorted vector and map/tree are stable, always behave the same - logaritmic complexity).

Upvotes: 1

Garrappachc

Reputation: 749

Well, the optimal solution here would be to create std::map, which is logarithmic in complexity both in insertion and in lookup. Although, I don't see any reason why you wouldn't use std::vector. It is pretty fast when using quick sort to sort even 2M elements, especially if you do it once. std::binary_search is then really fast. Think it over if you need to make a lot of lookups between insertions.

Upvotes: 0

STL containers and large amounts of data

Answers (4)

Related Questions