Reputation: 75
I am working on a project (in .NET 3.5) that reads in 2 files, then compares them and finds the missing objects.
Based on this data, I need to parse it further and locate the object location. I'll try explaining this further:
I have 2 lists: 1 list is a very long list of all files on a server, along with their physical address on the server, or other server, this file is a little over 1 billion lines long and continuously growing (a littler ridiculous, I know). File size is around 160MB currently. The other list is a report list that shows missing files on the server. This list is miniscule compared to list 1, and is usually under 1MB in size.
I have to intersect list 2 with list 1 and determine where the missing objects are located. The items in the list look like this (unfortunately it is space separated and not a CSV document): filename.extension rev rev# source server:harddriveLocation\|filenameOnServer.extension origin
Using a stream, I read in both files into separate string lists. I then take a regex and parse items from list 2 into a third list that contains the filename.extension,rev and rev#. All this works fantastically, its the performance that is killing me.
I am hoping there is a much more efficient way to do what I am doing.
foreach (String item in slMissingObjectReport)
{
if (item.Contains(".ext1") || item.Contains(".ext2") || item.Contains(".ext3"))
{
if (!item.Contains("|"))
{
slMissingObjects.Add(item + "," + slMissingObjectReport[i + 1] + "," + slMissingObjectReport[i + 2]); //object, rev, version
}
}
i++;
}
int j = 1; //debug only
foreach (String item in slMissingObjects)
{
IEnumerable<String> found = Enumerable.Empty<String>();
Stopwatch matchTime = new Stopwatch(); //used for debugging
matchTime.Start(); //start the stop watch
foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
{
slFoundInAllObjects.Add(item);
}
matchTime.Stop();
tsStatus.Text = "Missing Object Count: " + slMissingObjects.Count + " | " + "All Objects count: " + slAllObjects.Count + " | Time elapsed: " + (taskTime.ElapsedMilliseconds) * 0.001 + "s | Items left: " + (slMissingObjects.Count - j).ToString();
j++;
}
taskTime.Stop();
lstStatus.Items.Add(("Time to complete all tasks: " + (taskTime.ElapsedMilliseconds) * 0.001) + "s");
This works, but since currently there are 1300 missing items in my missing objects list, it takes an average of 8 to 12 minutes to complete. The part that takes the longest is
foreach (String items in slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(',')))))
{
slFoundInAllObjects.Add(item);
}
I just need a point in the correct direction along with maybe a hand on how I can improve this code I am working on. The LINQ isn't the killer it seems, its adding it to a list that seems to kill the performance.
Upvotes: 3
Views: 11908
Reputation: 7592
One improvement you can make would be to use AddRange
instead of Add
. AddRange
will allow the internal list preallocate the memory it needs for the add, instead of multiple times throughout the course of your foreach
loop.
IEnumerable<string> items = slAllObjects.Where(s => s.Contains(item.Remove(item.IndexOf(','));
slFoundInAllObjects.AddRange(items);
Secondly, you should probably avoid item.Remove(item.IndexOf(',')
in your Where
lambda, as this would cause it to be executed once for every item in the list. That value is static and you can do it once ahead of time.
var itemWithoutComma = item.Remove(item.IndexOf(','));
IEnumerable<string> items = slAllObjects.Where(s => s.Contains(itemWithoutComma));
slFoundInAllObjects.AddRange(items);
Upvotes: 2
Reputation: 776
In addition to what has already been suggested, I would consider the use of trees. If I understood correctly, there is some sort of hierarchy (ie: server, file path, file name, etc) in the file names, right? By using a tree you reduce a lot the search space in each step.
Also, if you use a Dictionary<String, Node>
in each node, you can reduce the search time, which becomes O(1)
considering a constant number of hierarchy levels.
Also, if you decide to use arrays or array lists, avoid foreach
and use for
as it should be faster (no iterator used, so, for array lists at least, should be faster).
Let me know if anything is unclear.
Upvotes: 0
Reputation: 367
There seems to be a few bottlenecks which have been pointed out.
If I understand correctly you are:
So you have something of order: O(K + m * n * n)
.
The bottlenecks happens on steps 2 and 3 (the inner loop in your code).
Solution:
This solution should reduce O(n^2) * O(m)
to O(n) * O(k)
if you use hash set or O(n) * log(m)
if you sort the list.
Upvotes: 2
Reputation: 21492
First stop, don't use a List. Use HashSets for quicker insert and comparisons.
Next up, determine if the lists are in a pre-sorted order, if they are, then you can quickly read both files at the same time, and only do a single pass through each and never have to keep them in memory at all.
If all else fails, look into using LINQ's Intersects method which likely will perform much better than your home grown version of it.
Upvotes: 0
Reputation: 38810
Hashsets are designed specifically for this kind of task, where you have unique values and you need to compare them.
Lists, are not. They are just arbitrary collections.
My first port of call of this would be to use a HashSet<> and the various intersection methods that comes free with this.
Upvotes: 6