Non Sense
Non Sense

Reputation: 11

Handling millions of records in a collection

I'm currently facing the issue of having large xmls with millions of datasets. After reading and deserializing them (which does not take too much time), the data needs to be written to the database (which also does not take much time) after some data adaptions and calculations. There is one 'main table', containing the data that the other two tables are referencing to. This means I will insert distinct (no duplicates) data into that main table.

Now, the other two tables are filled based upon the same data, but there can be several entries referencing the same entry in the main table. Now my issue is: I have another list containing the records that are supposed to be inserted into my 2 tables. As soon as I encounter a duplicate in my record iteration, I need to find the existing match to the duplicate in the list. A duplicate in this case does exclude 1-2 properties which can be different, meaning I do have to adapt functions like Equals.

I have tried using a HashSet, but finding the item in a HashSet with millions of entries seems to run for too long. I have tried a BinarySearch on a List, but sorting takes forever and I am struggling with the implementation. My objects do not have a key to really identify them with, 8/10 properties together represent the primary key. I was thinking about splitting the files, maybe even let it run async - but I fear that data will become corrupt.

What is the most efficient way to go about this?

Upvotes: 0

Views: 272

Answers (2)

mickeldinesh
mickeldinesh

Reputation: 1

public class DbEntity : IEquatable<DbEntity>
{
    public DbEntity(int prop1, Guid prop2, string prop3, string prop4, string prop5)
    {
        Prop1 = prop1;
        Prop2 = prop2;
        Prop3 = prop3;
        Prop4 = prop4;
        Prop5 = prop5;
    }

    public int Prop1 { get; }
    public Guid Prop2 { get; }
    public string Prop3 { get; }
    public string Prop4 { get; }
    public string Prop5 { get; set; }

    public bool Equals(DbEntity? other) => 
        other is not null &&
        (ReferenceEquals(this, other) || 
        (Prop1 == other.Prop1 && Prop2 == other.Prop2 && Prop3 == other.Prop3 && Prop4 == other.Prop4));

    public override bool Equals(object? obj) => Equals(obj as DbEntity);

    public override int GetHashCode() => HashCode.Combine(Prop1, Prop2, Prop3, Prop4);
}

Upvotes: 0

Tim Schmelter
Tim Schmelter

Reputation: 460238

You should use an efficient implementation of GetHashCode and Equals:

public class DbEntity: IEquatable<DbEntity>
{
    public DbEntity(int prop1, Guid prop2, string prop3, string prop4, string prop5)
    {
        Prop1 = prop1;
        Prop2 = prop2;
        Prop3 = prop3;
        Prop4 = prop4;
        Prop5 = prop5;
    }

    public int Prop1 { get; }
    public Guid Prop2 { get; }
    public string Prop3 { get; }
    public string Prop4 { get; }
    public string Prop5 { get; set; }

    public bool Equals(DbEntity? other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        return Prop1 == other.Prop1 && Prop2.Equals(other.Prop2) && Prop3 == other.Prop3 && Prop4 == other.Prop4;
    }

    public override bool Equals(object? obj)
    {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != this.GetType()) return false;
        return Equals((DbEntity)obj);
    }

    public override int GetHashCode()
    {
        return HashCode.Combine(Prop1, Prop2, Prop3, Prop4);
    }
}

Then you can use a HashSet<T>:

HashSet<DbEntity> set = new(); // ...

Upvotes: 1

Related Questions