Reputation: 540

How do I compare and update 2 lists with huge number of items?

Note: Solved this using LINQ join.

I need to compare a list value from Source List if it is present in the destination list, and if yes then save it to a third list.

The code which I have written does work but it is taking a lot of time since my Source List has 30k items and it compares every item value to the destination list of 15 million which takes a lot of time. cause it will iterate through the whole list every time (30k *15 Million times)

See the code which is apparently not optimal but does the job.

        // The below code will generate the lists from CSV file
        The lists are below for sample

        **Source List**
        FileId  FilePath      FileChecksum
        1       somepath A    check1
        2       somepath AA   check2
        3       somepath AAB  check3
        4       somepath B    check4
        5       somepath BB   check5

        **Destination List**

        StepId  StatusID  JobId ProjectId FileId     FilePath
        5        6         4    2091      577206853  somepath A
        5        6         4    2092      577206853  somepath AA
        5        6         4    2093      577206853  somepath AAA
        5        6         4    2094      577206853  somepath AB
        5        6         4    2095      577206853  somepath A
        5        6         4    2096      577206853  somepath B
        5        6         4    2097      577206853  somepath BB

        List<Source> SourceList = File.ReadAllLines(@"D:\source.csv").Skip(1).Select(v => Source.SourceFromCSv(v)).ToList();

        List<Destination> DestinationList = File.ReadAllLines(@"D:\Destination.csv").Skip(1).Select(d => Destination.FromDestinationCSV(d)).ToList();

        //This will compare and create a new list
        var result1 =
            from s in SourceList
            from d in DestinationList
            where (d.FilePath.ToLower() == s.FilePath.ToLower())
             select (d.StepId + "," + d.StatusId + "," + d.JobId + "," + 
             d.ProjectId + "," + d.FileId + "," + d.FilePath + "," + 
             s.FileChecksum);



             Expected Result:
             StepId StatusID  JobId ProjectId FileId    FilePath      FileChecksum
             5       6         4    2091      577206853 somepath A    check1
             5       6         4    2092      577206853 somepath AA   check2
             5       6         4    2095      577206853 somepath A    check1
             5       6         4    2096      577206853 somepath B    check4
             5       6         4    2097      577206853 somepath BB   check5

Upvotes: 0

Answers (5)

shiva1791

Reputation: 540

var query = from s in SourceList
 join d in DestinationList on 
 s.FilePath.ToLower().TrimEnd() equals d.FilePath.ToLower().TrimEnd()
 select (d.StepId + "," + d.StatusId + "," + d.JobId + "," +d.ProjectId + "," + d.FileId + "," + d.FilePath + "," + s.FileChecksum);

LINQ join did the same thing in less than 5 seconds.

Upvotes: 1

Mr.Malte

Reputation: 27

You can order both lists and then compare row by row. The algorithmic Complexity is O(n log n+n).

You compare first line of data A to first line of data B and then increase the index of the pointer on the "bigger" line. If data A has 8 and data B has 7 and 9, you will know the 8 is not present in data B once you reached 9.

You should start comparing at the max possible index. This way, you get quick termination, if the list is indeed a sublist.

Upvotes: 3

jessehouwing

Reputation: 114661

Yeah, if you don't need all the features of it being a list, making the base type a HashSet<T> will significantly improve lookups. Your custom type may need to implement a proper GetHashCode() function to improve lookup speeds even further.

See:

Don't call new HashSet(query.ToList()), instead, convert directly to a hashset while instantiating the list, query.ToHashSet(), optionally passing in an Equality Comparer, see below:

ToHashSet(IEqualityComparer<T>)

Instead of a custom GetHashCode implementation you can also implement a custom IEqualityComparer to handle specific cases such as yours where specific fields make up the rules for equality. Visual Studio and Resharper nowadays offer a built-in refactor to generate a good implementation of GetHashCode and Equals.

See:

IEqualityComparer<T>

You can then use IntersectWith to get all items in both sets in a single call:

See:

IntersectWith()

Creating a special object you can convert both Source and Destination to, or giving them the same base class will allow this.

You can also use a IDictionary<Key, Value> and make the key the Item.FilePath.ToLower(), the same principles as above apply. This will allow the runtime to check whether the item exists in the other list using the GetHashCode of the string, which is highly optimized by default.

Upvotes: 1

dr b

Reputation: 2249

All you're doing in principle is appending the file checksum onto the end of the destination list.

Create a hash or dictionary out of the source list, then your new list looks something like this.

//create dictionary SourceDictionary<string,string> with key = filepath.tolower and value = checksum
var newList = DestinationList.select(d => $"{d.thing1},{d.thingN}" + SourceDictionary[d.filename.tolower()])

should be much faster

Upvotes: 0

Daniel

Reputation: 11

You could do it the other way around. Instead of picking one of the 30k source entries, you could iterate over your 30 million entries. you can then stop if you found all 30k entries or in worst case, after 30 million entries. thats still better than 30K*15M.

Upvotes: 1

How do I compare and update 2 lists with huge number of items?

Answers (5)

Related Questions