Reputation: 5892
I'm using the following queries to detect duplicates in a database.
Using a LINQ join doesn't work very well because Company X may also be listed as CompanyX, therefore I'd like to amend this to detect "near duplicates".
var results = result
.GroupBy(c => new {c.CompanyName})
.Select(g => new CompanyGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName,
}).ToList();
Could anybody suggest a way in which I have better control over the comparison? Perhaps via an IEqualityComparer (although I'm not exactly sure how that would work in this situation)
My main goals are:
Upvotes: 3
Views: 419
Reputation: 531
In this case you need to define where the work is going to take place i.e. fully on the server, in local memory or a mixture of both.
In local memory:
In this case we have two routes, to pull back all the data and just do the logic in local memory, or to stream the data and apply the logic piecewise. To pull all the data just ToList()
or ToArray()
the base table. To stream the data would suggest using ToLookup()
with custom IEqualityComparer, e.g.
public class CustomEqualityComparer: IEqualityComparer<String>
{
public bool Equals(String str1, String str2)
{
//custom logic
}
public int GetHashCode(String str)
{
// custom logic
}
}
//result
var results = result.ToLookup(r => r.Name,
new CustomEqualityComparer())
.Select(r => ....)
Fully on the server: Depends on your provider and what it can successfully map. E.g. if we define a near duplicate as one with an alternative delimiter one could do something like this:
private char[] delimiters = new char[]{' ','-','*'}
var results = result.GroupBy(r => delimiters.Aggregate( d => r.Replace(d,'')...
Mixture: In this case we are splitting the work between the two. Unless you come up with a nice scheme this route is most likely to be inefficient. E.g. if we keep the logic on the local side, build groupings as a mapping from a name into a key and just query the resulting groupings we can do something like this:
var groupings = result.Select(r => r.Name)
//pull into local memory
.ToArray()
//do local grouping logic...
//Query results
var results = result.GroupBy(r => groupings[r]).....
Personally I usually go with the first option, pulling all the data for small data sets and streaming large data sets (empirically I found streaming with logic between each pull takes a lot longer than pulling all the data then doing all the logic)
Notes: Dependent on the provider ToLookup()
is usually immediate execution and in construction applies its logic piecewise.
Upvotes: 1
Reputation: 50204
For your explicit "ignoring spaces" case, you can simply call
var results = result.GroupBy(c => c.Name.Replace(" ", ""))...
However, in the general case where you want flexibility, I'd build up a library of IEqualityComparer<Company>
classes to use in your groupings. For example, this should do the same in your "ignore space" case:
public class CompanyNameIgnoringSpaces : IEqualityComparer<Company>
{
public bool Equals(Company x, Company y)
{
return x.Name.Replace(" ", "") == y.Name.Replace(" ", "");
}
public int GetHashCode(Company obj)
{
return obj.Name.Replace(" ", "").GetHashCode();
}
}
which you could use as
var results = result.GroupBy(c => c, new CompanyNameIgnoringSpaces())...
It's pretty straightforward to do similar things containing multiple fields, or other definitions of similarity, etc.
Just note that your defintion of "similar" must be transitive, e.g. if you're looking at integers you can't define "similar" as "within 5", because then you'd have "0 is similar to 5" and "5 is similar to 10" but not "0 is similar to 10". (It must also be reflexive and symmetric, but that's more straightforward.)
Upvotes: 1
Reputation: 67938
Okay, so since you're looking for different permutations you could do something like this:
Bear in mind this was written in the answer so it may not fully compile, but you get the idea.
var results = result
.Where(g => CompanyNamePermutations(g.Key.CompanyName).Contains(g.Key.CompanyName))
.GroupBy(c => new {c.CompanyName})
.Select(g => new CompanyGridViewModel
{
LeadId = g.First().LeadId,
Qty = g.Count(),
CompanyName = g.Key.CompanyName,
}).ToList();
private static List<string> CompanyNamePermutations(string companyName)
{
// build your permutations here
// so to build the one in your example
return new List<string>
{
companyName,
string.Join("", companyName.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
};
}
Upvotes: 1