Tommi
Tommi

Reputation: 3247

.net Distinct() and complex conditons

Assume I have a class

public class Audio
{
    public string artist   { get; set; }
    public string title    { get; set; }
    // etc.
}

Now I want to filter duplicates in list of such audio's by similarity (not EXACT match) condition. Basicaly it's Levenstein distance with treshold correction by string total length. The problem is, general tip about IEqualityComparer is "Always implement both GetHashCode and Compare". I obviuosly can't calc distance between strings in GetHashCode because it's not a compare method at all. However in this case even similar audio's will return different hashes and Distinct() will treat it as different objects and compare() method does not fired.

I tried to force GetHashCode always returns 0, so Compare called for each-to-each object in collection, but this is slow. So, finally, a question: is there anything I can do with .net out of the box or should I search some good algorithm for filtering?

Upvotes: 5

Views: 305

Answers (2)

rkrahl
rkrahl

Reputation: 1177

Code for BKTree, with simple "c# interoperability" layer and example in c# is here:

https://bitbucket.org/ptasz3k/bktree

It's VS 2012 solution.

You start with building tree from all of your objects, passing selector function (x => x.Key.ToLowerInvariant() in example), then you search for a given key and levenshtein distance and tree returns all matching objects.

So, if i understand your problem correctly:

var bk = BKTree.CSharp.CreateBK(x => x.artist, audios);
var allArtists = audios.Select(x => x.artist);
var possibleDuplicates = allArtists.Select(x => new 
    { Key = x, Similiar = BKTree.CSharp.FindInBk(bk, x, treshold).ToList());

Hope this helps.

Upvotes: 1

Eduard Dumitru
Eduard Dumitru

Reputation: 3262

I would suggest (first of all) not using Distinct or GetHashCode.

GetHashCode is too strict for your case (as @Gabe pointed out perfectly). What you could do is:

  1. Admit that you will have to compare an entire triangle (O(n^2) complexity) of instances pairs using the Levenshtein
  2. Try to optimise that using every trick in the book: How does calculating the Levenshtein distance from the empty string to the current one sound (that is for each and every instance of Audio and probably for separately for both string properties) ?

That could end up (one might say) with a darn good GetHashCode. But you can't use it like a GetHashCode, you should rather use it like so:

bool AreSimilar(Audio me, Audio you) {
  int cheapLevenshtein = Math.Abs(me.AbsoluteQuasiLevenshtein - you.AbsoluteQuasiLevenshtein);

  if (cheapLevenshtein < THRESHOLD) {

    int expensiveLevenshtein = Audio.LevenshteinBetween(me, you);
    var result = (expensiveLevenshtein < LIMIT);
    return result;

  } else
    return false;
}

And then you'd end up with a better or worse algorithm. This was just an idea and, of course: you can't use Distinct(). If you wish, you can write you very own extension method to make the whole thing look nice from a user programmer's perspective.

And yes the AbsoluteQuasiLevenshtein would be equal for things like: "ab" and "zy" but it would differ greatly between "ab" and "blahblahblahblah" and at least you would optimize things a bit. (The GetHashCode + Distinct approach posed an additional problem - the strictness of GetHashCode).

Upvotes: 3

Related Questions