Reputation: 5892

Detecting "near duplicates" using a LINQ/C# query

I'm using the following queries to detect duplicates in a database.

Using a LINQ join doesn't work very well because Company X may also be listed as CompanyX, therefore I'd like to amend this to detect "near duplicates".

var results = result
                .GroupBy(c => new {c.CompanyName})
                .Select(g => new CompanyGridViewModel
                    {
                        LeadId = g.First().LeadId,
                        Qty = g.Count(),
                        CompanyName = g.Key.CompanyName,
                    }).ToList();

Could anybody suggest a way in which I have better control over the comparison? Perhaps via an IEqualityComparer (although I'm not exactly sure how that would work in this situation)

My main goals are:

To list the first record with a subset of all duplicates (or "near duplicates")
To have some flexibility over the fields and text comparisons I use for my duplicates.

Upvotes: 3

Answers (3)

user1793607

Reputation: 531

In this case you need to define where the work is going to take place i.e. fully on the server, in local memory or a mixture of both.

In local memory: In this case we have two routes, to pull back all the data and just do the logic in local memory, or to stream the data and apply the logic piecewise. To pull all the data just ToList() or ToArray() the base table. To stream the data would suggest using ToLookup() with custom IEqualityComparer, e.g.

public class CustomEqualityComparer: IEqualityComparer<String>
{
    public bool Equals(String str1, String str2)
    {
        //custom logic
    }

    public int GetHashCode(String str)
    {
        // custom logic
    }
}

//result
var results = result.ToLookup(r => r.Name, 
                    new CustomEqualityComparer())
                    .Select(r => ....)

Fully on the server: Depends on your provider and what it can successfully map. E.g. if we define a near duplicate as one with an alternative delimiter one could do something like this:
```
private char[] delimiters = new char[]{' ','-','*'}

var results = result.GroupBy(r => delimiters.Aggregate( d => r.Replace(d,'')...
```
Mixture: In this case we are splitting the work between the two. Unless you come up with a nice scheme this route is most likely to be inefficient. E.g. if we keep the logic on the local side, build groupings as a mapping from a name into a key and just query the resulting groupings we can do something like this:
```
var groupings = result.Select(r => r.Name)
                      //pull into local memory
                      .ToArray()
                      //do local grouping logic...

                      //Query results
var results = result.GroupBy(r => groupings[r]).....
```

Personally I usually go with the first option, pulling all the data for small data sets and streaming large data sets (empirically I found streaming with logic between each pull takes a lot longer than pulling all the data then doing all the logic)

Notes: Dependent on the provider ToLookup() is usually immediate execution and in construction applies its logic piecewise.

Upvotes: 1

Rawling

Reputation: 50204

For your explicit "ignoring spaces" case, you can simply call

var results = result.GroupBy(c => c.Name.Replace(" ", ""))...

However, in the general case where you want flexibility, I'd build up a library of IEqualityComparer<Company> classes to use in your groupings. For example, this should do the same in your "ignore space" case:

public class CompanyNameIgnoringSpaces : IEqualityComparer<Company>
{
    public bool Equals(Company x, Company y)
    {
        return x.Name.Replace(" ", "") == y.Name.Replace(" ", "");
    }

    public int GetHashCode(Company obj)
    {
        return obj.Name.Replace(" ", "").GetHashCode();
    }
}

which you could use as

var results = result.GroupBy(c => c, new CompanyNameIgnoringSpaces())...

It's pretty straightforward to do similar things containing multiple fields, or other definitions of similarity, etc.

Just note that your defintion of "similar" must be transitive, e.g. if you're looking at integers you can't define "similar" as "within 5", because then you'd have "0 is similar to 5" and "5 is similar to 10" but not "0 is similar to 10". (It must also be reflexive and symmetric, but that's more straightforward.)

Upvotes: 1

Mike Perrenoud

Reputation: 67938

Okay, so since you're looking for different permutations you could do something like this:

Bear in mind this was written in the answer so it may not fully compile, but you get the idea.

var results = result
    .Where(g => CompanyNamePermutations(g.Key.CompanyName).Contains(g.Key.CompanyName))
    .GroupBy(c => new {c.CompanyName})
    .Select(g => new CompanyGridViewModel
        {
            LeadId = g.First().LeadId,
            Qty = g.Count(),
            CompanyName = g.Key.CompanyName,
        }).ToList();

private static List<string> CompanyNamePermutations(string companyName)
{
    // build your permutations here
    // so to build the one in your example
    return new List<string>
    {
        companyName,
        string.Join("", companyName.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
    };
}

Upvotes: 1

Detecting &quot;near duplicates&quot; using a LINQ/C# query

Answers (3)

Related Questions

Detecting "near duplicates" using a LINQ/C# query