Reputation: 55
I have been tasked with matching 1.7 million records with some results which have been passed to me in a csv file.
little bit of background to the below code, i have two lists...
Certs which contains 5 properties with ID being the equivalent of a PK.
Orders which contains a a list of ID's which should be contained in the certs list.
I need to match the two and do something with those Cert objects which are found.
foreach (Classes.CertOrder.IDS OrderUnitID in Order.AllIDs)
{
var Cert = (from C in Certs where C.ID.ToUpper() == OrderUnitID.ID.ToUpper() select C).FirstOrDefault();
if (Cert != null)
{
Output.add(Cert)
OrderUnitID.fulfilled = true;
}
}
This code works but its super slow (to be expected i guess with the amount of records) Is there any way i can speed this up?
Edit to Add, would love to be able to add the data to a SQL server to run the queries however the data is such that it is not allowed to leave the workstation on which the file is being processed or even allowed to touch the disk in an un-encrypted form.
In combination with the helpful answer below i have change my output to be list based, pre-sorted both lists by ID and now the processing takes seconds rather than hours! Thanks stack overflow!
Upvotes: 0
Views: 119
Reputation: 46
Extending on the accepted answer a bit a few other options. OrdinalIgnoreCase gives you the best single-threaded performance while Parallelizing it gives the best overall performance.
class Item { public string Id { get; set; } }
class Program
{
private static Random rng = new Random();
private static string characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
static void Main(string[] args)
{
var list = Enumerable.Range(1, 2_700_000)
.Select(x => string.Join("", Enumerable.Range(5, rng.Next(20)).Select(y => characters[rng.Next(0, characters.Length)])))
.Distinct(StringComparer.OrdinalIgnoreCase)
.Select(x => new {Order = rng.Next(), Item = new Item {Id = x }})
.OrderBy(x => x.Order)
.Select(x => x.Item)
.ToList();
Console.WriteLine("Master List Size: {0}", list.Count);
var matches = list.Take(350_000).Select(x => x.Id).ToList();
Console.WriteLine("Matches List Size: {0}", matches.Count);
var dict = list.ToDictionary(x => x.Id, x => x, StringComparer.CurrentCultureIgnoreCase);
var results = new List<Item>();
var sw = new Stopwatch();
Console.WriteLine("CurrentCultureIgnoreCase Elapsed Time (avg): {0}",
Enumerable.Range(1, 10).Select(x =>
{
sw.Start();
foreach (var m in matches)
if (dict.TryGetValue(m, out var item))
results.Add(item);
sw.Stop();
var t = sw.ElapsedMilliseconds;
sw.Reset();
return t;
}).Average());
dict = list.ToDictionary(x => x.Id.ToUpper(), x => x);
Console.WriteLine("ToUpper() Elapsed Time (avg): {0}",
Enumerable.Range(1, 10).Select(x =>
{
sw.Start();
foreach (var m in matches)
if (dict.TryGetValue(m.ToUpper(), out var item))
results.Add(item);
sw.Stop();
var t = sw.ElapsedMilliseconds;
sw.Reset();
return t;
}).Average());
dict = list.ToDictionary(x => x.Id, x => x, StringComparer.OrdinalIgnoreCase);
Console.WriteLine("OrdinalIgnoreCase Elapsed Time (avg): {0}",
Enumerable.Range(1, 10).Select(x =>
{
sw.Start();
foreach (var m in matches)
if (dict.TryGetValue(m, out var item))
results.Add(item);
sw.Stop();
var t = sw.ElapsedMilliseconds;
sw.Reset();
return t;
}).Average());
}
}
var cDict = new ConcurrentDictionary<string,Item>(dict);
var cResults = new ConcurrentBag<Item>();
Console.WriteLine("Parallel Elapsed Time (avg): {0}",
Enumerable.Range(1, 10).Select(x =>
{
sw.Start();
Parallel.ForEach(matches, new ParallelOptions{MaxDegreeOfParallelism = 20}, m =>
{
if (cDict.TryGetValue(m, out var item))
cResults.Add(item);
});
sw.Stop();
var t = sw.ElapsedMilliseconds;
sw.Reset();
return t;
}).Average());
Results
Master List Size: 2158882
Matches List Size: 350000
CurrentCultureIgnoreCase Elapsed Time (avg): 298.2
ToUpper() Elapsed Time (avg): 179.6
OrdinalIgnoreCase Elapsed Time (avg): 163.9
Parallel Elapsed Time (avg): 74.6
Upvotes: 0
Reputation: 37770
Buid a dictionary from Certs
:
var certsMapping = Certs
.ToDictionary(_ => _.ID.ToUpper());
foreach (Classes.CertOrder.IDS OrderUnitID in Order.AllIDs)
{
if (certMapping.TryGetValue(OrderUnitID.ID.ToUpper(), out var cert))
{
Output.add(cert);
OrderUnitID.fulfilled = true;
}
}
Upvotes: 1
Reputation: 4014
Why database lookups are faster? one of the reason are indexes
you can use i4o to create an index for in-memory lists.
And then use a parallel for-each loop to speed things up.
Upvotes: 0