Reputation: 897

Generic List Contains() perfomance and alternatives

I need to store big amount of key, value pairs where key is not unique. Both key and value are strings. And items count is about 5 million.

My goal is to hold only unique pairs.

I've tried to use List<KeyValuePair<string, string>>, but the Contains() is extremely slow. LINQ Any() looks a little bit faster, but still too slow.

Are there any alternatives to perform the search faster on a generic list? Or maybe I should use another storage?

Upvotes: 2

Answers (6)

Julián Urbano

Reputation: 8488

I would use a Dictionary<string, HashSet<string>> mapping one key to all its values.

Here is a full solution. First, write a couple of extension methods to add a (key,value) pair to your Dictionary and another one to get all (key,value) pairs. Note that I use arbitrary types for keys and values, you can substitute this with string without problem. You can even write these methods somewhere else instead of as extensions, or not use methods at all and just use this code somewhere in your program.

public static class Program
{
  public static void Add<TKey, TValue>(
    this Dictionary<TKey, HashSet<TValue>> data, TKey key, TValue value)
  {
    HashSet<TValue> values = null;
    if (!data.TryGetValue(key, out values)) {
      // first time using this key? create a new HashSet 
      values = new HashSet<TValue>();
      data.Add(key, values);
    }
    values.Add(value);
  }
  public static IEnumerable<KeyValuePair<TKey, TValue>> KeyValuePairs<TKey, TValue>(
    this Dictionary<TKey, HashSet<TValue>> data)
  {
    return data.SelectMany(k => k.Value,
                           (k, v) => new KeyValuePair<TKey, TValue>(k.Key, v));
  }
}

Now you can use it as follows:

public static void Main(string[] args)
{
  Dictionary<string, HashSet<string>> data = new Dictionary<string, HashSet<string>>();
  data.Add("k1", "v1.1");
  data.Add("k1", "v1.2");
  data.Add("k1", "v1.1"); // already in, so nothing happens here
  data.Add("k2", "v2.1");

  foreach (var kv in data.KeyValuePairs())
     Console.WriteLine(kv.Key + " : " + kv.Value);
}

Which will print this:

k1 : v1.1
k1 : v1.2
k2 : v2.1

If your key mapped to a List<string> then you would need to take care of duplicates yourself. HashSet<string> does that for you already.

Upvotes: 5

qxn

Reputation: 17574

You will most likely see an improvement if you use a HashSet<KeyValuePair<string, string>>.

The test below finishes on my machine in about 10 seconds. If I change...

var collection = new HashSet<KeyValuePair<string, string>>();

...to...

var collection = new List<KeyValuePair<string, string>>();

...I get tired of waiting for it to complete (more than a few minutes).

Using a KeyValuePair<string, string> has the advantage that equality is determined by the values of Key and Value. Since strings are interned, and KeyValuePair<TKey, TValue> is a struct, pairs with the same Key and Value will be considered equal by the runtime.

You can see that equality with this test:

    var hs = new HashSet<KeyValuePair<string, string>>();
    hs.Add(new KeyValuePair<string, string>("key", "value"));
    var b = hs.Contains(new KeyValuePair<string, string>("key", "value"));
    Console.WriteLine(b);

One thing that's important to remember, though, is that the equality of pairs depends on the internment of strings. If, for some reason, your strings aren't interned (because they come from a file or something), the equality probably won't work.

using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace ConsoleApplication1 {

    internal class Program {

        static void Main(string[] args) {

            var key = default(string);
            var value = default(string);

            var collection = new HashSet<KeyValuePair<string, string>>();

            for (var i = 0; i < 5000000; i++) {

                if (key == null || i % 2 == 0) {
                    key = "k" + i;
                }
                value = "v" + i;

                collection.Add(new KeyValuePair<string, string>(key, value));
            }

            var found = 0;

            var sw = new Stopwatch();
            sw.Start();
            for (var i = 0; i < 5000000; i++) {

                if (collection.Contains(new KeyValuePair<string, string>("k" + i, "v" + i))) {
                    found++;
                }
            }
            sw.Stop();

            Console.WriteLine("Found " + found);
            Console.WriteLine(sw.Elapsed);
            Console.ReadLine();
        }
    }
}

Upvotes: 1

Scott Chamberlain

Reputation: 127543

To make a unique list you want to use .Distinct() to generate it, not .Contains(). However whatever class holds your strings must implement .GetHashCode() and .Equals() correctly to get good performance or you must pass in a custom comparer.

Here is how you could do it with a custom comparer

    private static void Main(string[] args)
    {

        List<KeyValuePair<string, string>> giantList = Populate();
        var uniqueItems = giantList.Distinct(new MyStringEquater()).ToList();
    }

    class MyStringEquater : IEqualityComparer<KeyValuePair<string, string>>
    {
        //Choose which comparer you want based on if you want your comparisions to be case sensitive or not
        private static StringComparer comparer = StringComparer.OrdinalIgnoreCase; 

        public bool Equals(KeyValuePair<string, string> x, KeyValuePair<string, string> y)
        {
            return comparer.Equals(x.Key, y.Key) && comparer.Equals(x.Value, y.Value);
        }

        public int GetHashCode(KeyValuePair<string, string> obj)
        {
            unchecked
            {
                int x = 27;
                x = x*11 + comparer.GetHashCode(obj.Key);
                x = x*11 + comparer.GetHashCode(obj.Value);
                return x;
            }
        }
    }

Also per your comment in the other answer you could also use the above comparer in a HashSet and have it store your unique items that way. You just need to pass in the comparer in to the constructor.

var hashSetWithComparer = new HashSet<KeyValuePair<string,string>(new MyStringEquater());

Upvotes: 1

Konrad Kokosa

Reputation: 16878

I would consider using some in-proc NoSQL database like RavenDB (RavenDB Embedded in this case) as they state on their website:

RavenDB can be used for application that needs to store millions of records and has fast query times.

Using it is requires no big boilerplate (example from RavenDB website):

var myCompany = new Company
                {
                    Name = "Hibernating Rhinos",
                    Employees = {
                                   new Employee
                                   {
                                       Name = "Ayende Rahien"
                                   }
                                 },
                    Country = "Israel"
                };

// Store the company in our RavenDB server
using (var session = documentStore.OpenSession())
{
    session.Store(myCompany);
    session.SaveChanges();
}

// Create a new session, retrieve an entity, and change it a bit
using (var session = documentStore.OpenSession())
{
    Company entity = session.Query<Company>()
        .Where(x => x.Country == "Israel")
        .FirstOrDefault();

    // We can also load by ID: session.Load<Company>(companyId);
    entity.Name = "Another Company";
    session.SaveChanges(); // will send the change to the database
}

Upvotes: 1

Klark

Reputation: 8280

I guess that Dictionary<string, List<string>> will do the trick.

Upvotes: 1

DoctorMick

Reputation: 6793

Have you tried using a Hashset? Much quicker than lists when large numbers are involved although I don't know if it'd still be too slow.

This answer has a lot of information: HashSet vs. List performance

Upvotes: 0

Generic List Contains() perfomance and alternatives

Answers (6)

Related Questions