Andriy Drozdyuk
Andriy Drozdyuk

Reputation: 61091

Fast collection comparison

I have the following data type:

ISet<IEnumerable<Foo>> 

So, I need to be able to create sets of sequences. E.g. this is ok:

ABC,AC,A

but this is not (since "AB" is repeated here"):

AB,A,ABC,BCA,AB

But, in order to do this - for "set" to not contain duplicates, I need to wrap my IEnumerable in some kind of other data type:

ISet<Seq>
//where
Seq : IEnumerable<Foo>, IEquatable<Seq>

Thus, I will be able to compare two sequences, and provide the Set data structure with a way of eliminating duplicates.

My question is: is there a fast data structure that allows for comparing sequences? I am thinking that somehow when Seq gets created, or added two, some kind of cumulative value is computed.

In other words, is it possible to implement Seq in such a way that I could do this:

var seq1 = new Seq( IList<Foo> );
var seq2 = new Seq( IList<Foo> )
seq1.equals(seq2) // O(1)

Thanks.

Upvotes: 1

Views: 186

Answers (2)

Alexei Levenkov
Alexei Levenkov

Reputation: 100547

O(1) essentially mean you are not allowed to compare values of elements. If you can represent sequence as list of immutable objects (with caching on add so there is no duplicates across all instances) you can achieve it as you'd only need to compare first element - similar how string interning works.

Insert will have to search for all instances of elements for "current"+"with this next" element. Some sort of dictionary may be reasonable approach...

EDIT: I think it simply tried to come up with suffix tree.

Upvotes: 1

Servy
Servy

Reputation: 203835

I have provided an implementation your sequence below. There are several points to note:

  1. This only works if the IEnumerable<T> returns the same items every time it is enumerated, and that those items are not mutated during the scope of this object.
  2. The hash code is cached. The first time it is requested it calculated it (feel free to improve the hash code algorithm if you know a better one) based on a full iteration of the underlying sequence. Because it only needs to be calculated once, this can be effectively considered O(1) if you compute it often. It's likely that adding to the set will be a bit slower (first time computation of the hash value) but searching or removing will be very quick.
  3. The equals method first compares the hash codes. If the hash codes are different then the objects cannot possibly be equal (if the hash codes were properly implemented on all objects in the sequence, and nothing was mutated). As long as you have a low rate of collision, and are usually comparing items that aren't actually equal, this means that equals checks will not often get past that hash code check. If they do, an iteration of the sequence is needed (there is no way around that). Because of that the equals is likely to average O(1), even though its worst case is still O(n).

    public class Foo : IEnumerable { private IEnumerable sequence;

    private int? myHashCode = null;
    
    public Foo(IEnumerable<T> sequence)
    {
        this.sequence = sequence;
    }
    
    public IEnumerator<T> GetEnumerator()
    {
        return sequence.GetEnumerator();
    }
    
    IEnumerator IEnumerable.GetEnumerator()
    {
        return sequence.GetEnumerator();
    }
    
    public override bool Equals(object obj)
    {
        Foo<T> other = obj as Foo<T>;
        if(other == null)
            return false;
    
        //if the hash codes are different we don't need to bother doing a deep equals check
        //the hash code is cached, so it's fast.
        if (GetHashCode() != obj.GetHashCode())
            return false;
    
        return Enumerable.SequenceEqual(sequence, other.sequence);
    }
    
    public override int GetHashCode()
    {
        //note that the hash code is cached, so the underlying sequence 
        //needs to not change.
        return myHashCode ?? populateHashCode();
    }
    
    private int populateHashCode()
    {
        int somePrimeNumber = 37;
        myHashCode = 1;
        foreach (T item in sequence)
        {
            myHashCode = (myHashCode * somePrimeNumber) + item.GetHashCode();
        }
    
        return myHashCode.Value;
    }
    

    }

Upvotes: 2

Related Questions