Reputation: 897
I need to store big amount of key, value pairs where key is not unique. Both key and value are strings. And items count is about 5 million.
My goal is to hold only unique pairs.
I've tried to use List<KeyValuePair<string, string>>
, but the Contains()
is extremely slow.
LINQ Any()
looks a little bit faster, but still too slow.
Are there any alternatives to perform the search faster on a generic list? Or maybe I should use another storage?
Upvotes: 2
Views: 3866
Reputation: 8488
I would use a Dictionary<string, HashSet<string>>
mapping one key to all its values.
Here is a full solution. First, write a couple of extension methods to add a (key,value)
pair to your Dictionary
and another one to get all (key,value)
pairs. Note that I use arbitrary types for keys and values, you can substitute this with string
without problem.
You can even write these methods somewhere else instead of as extensions, or not use methods at all and just use this code somewhere in your program.
public static class Program
{
public static void Add<TKey, TValue>(
this Dictionary<TKey, HashSet<TValue>> data, TKey key, TValue value)
{
HashSet<TValue> values = null;
if (!data.TryGetValue(key, out values)) {
// first time using this key? create a new HashSet
values = new HashSet<TValue>();
data.Add(key, values);
}
values.Add(value);
}
public static IEnumerable<KeyValuePair<TKey, TValue>> KeyValuePairs<TKey, TValue>(
this Dictionary<TKey, HashSet<TValue>> data)
{
return data.SelectMany(k => k.Value,
(k, v) => new KeyValuePair<TKey, TValue>(k.Key, v));
}
}
Now you can use it as follows:
public static void Main(string[] args)
{
Dictionary<string, HashSet<string>> data = new Dictionary<string, HashSet<string>>();
data.Add("k1", "v1.1");
data.Add("k1", "v1.2");
data.Add("k1", "v1.1"); // already in, so nothing happens here
data.Add("k2", "v2.1");
foreach (var kv in data.KeyValuePairs())
Console.WriteLine(kv.Key + " : " + kv.Value);
}
Which will print this:
k1 : v1.1
k1 : v1.2
k2 : v2.1
If your key mapped to a List<string>
then you would need to take care of duplicates yourself. HashSet<string>
does that for you already.
Upvotes: 5
Reputation: 17574
You will most likely see an improvement if you use a HashSet<KeyValuePair<string, string>>
.
The test below finishes on my machine in about 10 seconds. If I change...
var collection = new HashSet<KeyValuePair<string, string>>();
...to...
var collection = new List<KeyValuePair<string, string>>();
...I get tired of waiting for it to complete (more than a few minutes).
Using a KeyValuePair<string, string>
has the advantage that equality is determined by the values of Key
and Value
. Since strings are interned, and KeyValuePair<TKey, TValue>
is a struct, pairs with the same Key
and Value
will be considered equal by the runtime.
You can see that equality with this test:
var hs = new HashSet<KeyValuePair<string, string>>();
hs.Add(new KeyValuePair<string, string>("key", "value"));
var b = hs.Contains(new KeyValuePair<string, string>("key", "value"));
Console.WriteLine(b);
One thing that's important to remember, though, is that the equality of pairs depends on the internment of strings. If, for some reason, your strings aren't interned (because they come from a file or something), the equality probably won't work.
using System;
using System.Collections.Generic;
using System.Diagnostics;
namespace ConsoleApplication1 {
internal class Program {
static void Main(string[] args) {
var key = default(string);
var value = default(string);
var collection = new HashSet<KeyValuePair<string, string>>();
for (var i = 0; i < 5000000; i++) {
if (key == null || i % 2 == 0) {
key = "k" + i;
}
value = "v" + i;
collection.Add(new KeyValuePair<string, string>(key, value));
}
var found = 0;
var sw = new Stopwatch();
sw.Start();
for (var i = 0; i < 5000000; i++) {
if (collection.Contains(new KeyValuePair<string, string>("k" + i, "v" + i))) {
found++;
}
}
sw.Stop();
Console.WriteLine("Found " + found);
Console.WriteLine(sw.Elapsed);
Console.ReadLine();
}
}
}
Upvotes: 1
Reputation: 127543
To make a unique list you want to use .Distinct()
to generate it, not .Contains()
. However whatever class holds your strings must implement .GetHashCode()
and .Equals()
correctly to get good performance or you must pass in a custom comparer.
Here is how you could do it with a custom comparer
private static void Main(string[] args)
{
List<KeyValuePair<string, string>> giantList = Populate();
var uniqueItems = giantList.Distinct(new MyStringEquater()).ToList();
}
class MyStringEquater : IEqualityComparer<KeyValuePair<string, string>>
{
//Choose which comparer you want based on if you want your comparisions to be case sensitive or not
private static StringComparer comparer = StringComparer.OrdinalIgnoreCase;
public bool Equals(KeyValuePair<string, string> x, KeyValuePair<string, string> y)
{
return comparer.Equals(x.Key, y.Key) && comparer.Equals(x.Value, y.Value);
}
public int GetHashCode(KeyValuePair<string, string> obj)
{
unchecked
{
int x = 27;
x = x*11 + comparer.GetHashCode(obj.Key);
x = x*11 + comparer.GetHashCode(obj.Value);
return x;
}
}
}
Also per your comment in the other answer you could also use the above comparer in a HashSet and have it store your unique items that way. You just need to pass in the comparer in to the constructor.
var hashSetWithComparer = new HashSet<KeyValuePair<string,string>(new MyStringEquater());
Upvotes: 1
Reputation: 16878
I would consider using some in-proc NoSQL database like RavenDB (RavenDB Embedded in this case) as they state on their website:
RavenDB can be used for application that needs to store millions of records and has fast query times.
Using it is requires no big boilerplate (example from RavenDB website):
var myCompany = new Company
{
Name = "Hibernating Rhinos",
Employees = {
new Employee
{
Name = "Ayende Rahien"
}
},
Country = "Israel"
};
// Store the company in our RavenDB server
using (var session = documentStore.OpenSession())
{
session.Store(myCompany);
session.SaveChanges();
}
// Create a new session, retrieve an entity, and change it a bit
using (var session = documentStore.OpenSession())
{
Company entity = session.Query<Company>()
.Where(x => x.Country == "Israel")
.FirstOrDefault();
// We can also load by ID: session.Load<Company>(companyId);
entity.Name = "Another Company";
session.SaveChanges(); // will send the change to the database
}
Upvotes: 1
Reputation: 6793
Have you tried using a Hashset? Much quicker than lists when large numbers are involved although I don't know if it'd still be too slow.
This answer has a lot of information: HashSet vs. List performance
Upvotes: 0