Eric Sondergard
Eric Sondergard

Reputation: 605

What is the most performant way to initialize a Dictionary from an IEnumerable?

I have a need to maintain a cache of database objects that are uniquely keyed (by an integer). A query delivers an instance of IEnumerable<MyEntity> (MyEntity uses an int primary key) with the results, and I'd like to initialize an instance of Dictionary<int, MyEntity> as fast as possible, because this query can return a few hundred thousand rows.

What is the most performant way to initialize an instance of Dictionary<int, MyEntity> from an IEnumerable<MyEntity>?

In short, I want to know if there is a more performant way to do this:

IEnumerable<MyEntity> entities = DoSomeQuery();

var cache = new Dictionary<int, MyEntity>();

foreach (var entity in entities)
    cache.Add(entity.Id, entity);

//or...

cache = entities.ToDictionary(e => e.Id);

Of course, the query has the biggest potential performance consequences, but it's important that I shave milliseconds wherever I can for my use case.

EDIT:

Worth noting here that .ToDictionary<TKey, TElement> literally runs a foreach loop like the first example, so one could assume the perf would be the exact same if not slightly worse. Maybe that's my answer right there.

Upvotes: 1

Views: 1442

Answers (1)

Jon Hanna
Jon Hanna

Reputation: 113242

You're about as fast as you can get.

If you can quickly determine the number of elements you are going to add then passing that as the capacity to the Dictionary constructor will give a bit of a boost by preventing internal resize operations (the .NET Core version of ToDictionary() does that, the other versions do not).

If the keys are relatively tightly packed then you can benefit from sizing to the range rather than the count. E.g. if you had Ids of {5, 6, 7, 9, 10, 11} then it would be beneficial to size to 7 (the number of values you would have if the missing 8 was there) rather than 6. (Actually, it would make no difference here, as the effect only kicks in with larger sets than this). The effect is rather small though, so not worth doing if you're going to be wasting a lot of memory (e.g. it's defintely not worth storing {8, 307} in a 300-capacity dictionary! The benefit comes from increasing how often a key will be hashed to something that isn't going to clash with another element during the period when the internal size (and hence the internal hash-reduction) is smaller than it will be when you are finished adding them all.

If they are tightly packed but you can't predict the size then there's a benefit in storing them in order, because as the internal storage grows, there'll more often be a case where the dictionary wants to store something with an unused reduced hash code. The benefit though will be smaller than the cost of sorting in-memory (and that will require finding the number of elements anyway, either explicitly or within an OrderBy operation) so it's only helpful if there's a way of getting that ordering done for you cheaply. (E.g. some webservices require that some sort of ordering criteria be given, so you may as well give the id as the criteria. Mostly this won't be applicable).

These points, especially the last two, are tiny effects though, likely to not add up to anything measurable. Even the first is going to be smaller than the cost of obtaining the count if it isn't already in a source that has a cheap Count or Length operation.

The foreach itself can perhaps be improved by replacing with indexing (when applicable) but also sometimes that's worse. It also tends to do better on some concretely-typed source (i.e. foreach on T[] array beats foreach on List<T> beats foreach on IEnumerable<T>) but that means exposing implementation details between layers and is rarely worth it, especially since many collection types don't have any benefit through this.

Upvotes: 4

Related Questions