Reputation: 540
Note: Solved this using LINQ join.
I need to compare a list value from Source List if it is present in the destination list, and if yes then save it to a third list.
The code which I have written does work but it is taking a lot of time since my Source List has 30k items and it compares every item value to the destination list of 15 million which takes a lot of time. cause it will iterate through the whole list every time (30k *15 Million times)
See the code which is apparently not optimal but does the job.
// The below code will generate the lists from CSV file
The lists are below for sample
**Source List**
FileId FilePath FileChecksum
1 somepath A check1
2 somepath AA check2
3 somepath AAB check3
4 somepath B check4
5 somepath BB check5
**Destination List**
StepId StatusID JobId ProjectId FileId FilePath
5 6 4 2091 577206853 somepath A
5 6 4 2092 577206853 somepath AA
5 6 4 2093 577206853 somepath AAA
5 6 4 2094 577206853 somepath AB
5 6 4 2095 577206853 somepath A
5 6 4 2096 577206853 somepath B
5 6 4 2097 577206853 somepath BB
List<Source> SourceList = File.ReadAllLines(@"D:\source.csv").Skip(1).Select(v => Source.SourceFromCSv(v)).ToList();
List<Destination> DestinationList = File.ReadAllLines(@"D:\Destination.csv").Skip(1).Select(d => Destination.FromDestinationCSV(d)).ToList();
//This will compare and create a new list
var result1 =
from s in SourceList
from d in DestinationList
where (d.FilePath.ToLower() == s.FilePath.ToLower())
select (d.StepId + "," + d.StatusId + "," + d.JobId + "," +
d.ProjectId + "," + d.FileId + "," + d.FilePath + "," +
s.FileChecksum);
Expected Result:
StepId StatusID JobId ProjectId FileId FilePath FileChecksum
5 6 4 2091 577206853 somepath A check1
5 6 4 2092 577206853 somepath AA check2
5 6 4 2095 577206853 somepath A check1
5 6 4 2096 577206853 somepath B check4
5 6 4 2097 577206853 somepath BB check5
Upvotes: 0
Views: 272
Reputation: 540
var query = from s in SourceList
join d in DestinationList on
s.FilePath.ToLower().TrimEnd() equals d.FilePath.ToLower().TrimEnd()
select (d.StepId + "," + d.StatusId + "," + d.JobId + "," +d.ProjectId + "," + d.FileId + "," + d.FilePath + "," + s.FileChecksum);
LINQ join did the same thing in less than 5 seconds.
Upvotes: 1
Reputation: 27
You can order both lists and then compare row by row. The algorithmic Complexity is O(n log n+n).
You compare first line of data A to first line of data B and then increase the index of the pointer on the "bigger" line. If data A has 8 and data B has 7 and 9, you will know the 8 is not present in data B once you reached 9.
You should start comparing at the max possible index. This way, you get quick termination, if the list is indeed a sublist.
Upvotes: 3
Reputation: 114661
Yeah, if you don't need all the features of it being a list, making the base type a HashSet<T>
will significantly improve lookups. Your custom type may need to implement a proper GetHashCode()
function to improve lookup speeds even further.
See:
Don't call new HashSet(query.ToList())
, instead, convert directly to a hashset while instantiating the list, query.ToHashSet()
, optionally passing in an Equality Comparer, see below:
Instead of a custom GetHashCode
implementation you can also implement a custom IEqualityComparer
to handle specific cases such as yours where specific fields make up the rules for equality. Visual Studio and Resharper nowadays offer a built-in refactor to generate a good implementation of GetHashCode
and Equals
.
See:
You can then use IntersectWith
to get all items in both sets in a single call:
See:
Creating a special object you can convert both Source
and Destination
to, or giving them the same base class will allow this.
You can also use a IDictionary<Key, Value>
and make the key the Item.FilePath.ToLower()
, the same principles as above apply. This will allow the runtime to check whether the item exists in the other list using the GetHashCode
of the string, which is highly optimized by default.
Upvotes: 1
Reputation: 2249
All you're doing in principle is appending the file checksum onto the end of the destination list.
Create a hash or dictionary out of the source list, then your new list looks something like this.
//create dictionary SourceDictionary<string,string> with key = filepath.tolower and value = checksum
var newList = DestinationList.select(d => $"{d.thing1},{d.thingN}" + SourceDictionary[d.filename.tolower()])
should be much faster
Upvotes: 0
Reputation: 11
You could do it the other way around. Instead of picking one of the 30k source entries, you could iterate over your 30 million entries. you can then stop if you found all 30k entries or in worst case, after 30 million entries. thats still better than 30K*15M.
Upvotes: 1