Reputation: 3247
Assume I have a class
public class Audio
{
public string artist { get; set; }
public string title { get; set; }
// etc.
}
Now I want to filter duplicates in list of such audio's by similarity (not EXACT match) condition. Basicaly it's Levenstein distance with treshold correction by string total length. The problem is, general tip about IEqualityComparer is "Always implement both GetHashCode and Compare". I obviuosly can't calc distance between strings in GetHashCode because it's not a compare method at all. However in this case even similar audio's will return different hashes and Distinct() will treat it as different objects and compare() method does not fired.
I tried to force GetHashCode always returns 0, so Compare called for each-to-each object in collection, but this is slow. So, finally, a question: is there anything I can do with .net out of the box or should I search some good algorithm for filtering?
Upvotes: 5
Views: 305
Reputation: 1177
Code for BKTree, with simple "c# interoperability" layer and example in c# is here:
https://bitbucket.org/ptasz3k/bktree
It's VS 2012 solution.
You start with building tree from all of your objects, passing selector function (x => x.Key.ToLowerInvariant() in example), then you search for a given key and levenshtein distance and tree returns all matching objects.
So, if i understand your problem correctly:
var bk = BKTree.CSharp.CreateBK(x => x.artist, audios);
var allArtists = audios.Select(x => x.artist);
var possibleDuplicates = allArtists.Select(x => new
{ Key = x, Similiar = BKTree.CSharp.FindInBk(bk, x, treshold).ToList());
Hope this helps.
Upvotes: 1
Reputation: 3262
I would suggest (first of all) not using Distinct or GetHashCode.
GetHashCode is too strict for your case (as @Gabe pointed out perfectly). What you could do is:
That could end up (one might say) with a darn good GetHashCode. But you can't use it like a GetHashCode, you should rather use it like so:
bool AreSimilar(Audio me, Audio you) {
int cheapLevenshtein = Math.Abs(me.AbsoluteQuasiLevenshtein - you.AbsoluteQuasiLevenshtein);
if (cheapLevenshtein < THRESHOLD) {
int expensiveLevenshtein = Audio.LevenshteinBetween(me, you);
var result = (expensiveLevenshtein < LIMIT);
return result;
} else
return false;
}
And then you'd end up with a better or worse algorithm. This was just an idea and, of course: you can't use Distinct(). If you wish, you can write you very own extension method to make the whole thing look nice from a user programmer's perspective.
And yes the AbsoluteQuasiLevenshtein would be equal for things like: "ab" and "zy" but it would differ greatly between "ab" and "blahblahblahblah" and at least you would optimize things a bit. (The GetHashCode + Distinct approach posed an additional problem - the strictness of GetHashCode).
Upvotes: 3