10101
10101

Reputation: 2412

Search for duplicates by word match

I have data in list like this:

Microsoft Ltd
Microsoft
Google Inc
Amazon Ltd.
Amazon Ltd.
DropBox Corporation Ltd.
DropBox Corporation

My current solution is able to detect exactly matching duplicates. As a result it will currently output:

Amazon Ltd.
Amazon Ltd.

I would like to add possibility so that these would be in the output list as well:

Microsoft Ltd
Microsoft
Amazon Ltd.
Amazon Ltd.
DropBox Corporation Ltd.
DropBox Corporation

Here is my current code:

var dups = companyList.AsEnumerable()
.Where(g => !string.IsNullOrWhiteSpace(g.Name))
.GroupBy(dr => dr.Name.Trim())
.Where(gr => gr.Count() > 1)
.SelectMany(g => g)
.OrderBy(c => c.Name)
.ToList();

I would be very thankful for any kind suggestion, that would lead into solution for achieving such a check? I personally think here is no any possible logical solution? Maybe only some kind of Levenshtein Distance calculation and detection based on score? If it is not anyhow possible, would be beneficial to get these at least (matching by multiple words, for example two):

DropBox Corporation Ltd.
DropBox Corporation

Upvotes: 1

Views: 103

Answers (2)

Kerrmiter
Kerrmiter

Reputation: 676

You can write your own equality comparer where you define when two company names are taken as the same company. It needs to implement two methods:

  • GetHashCode() which kind of segregates which companies will ever be compared - they need to have the same hash code value. In your case I don't see better choice then hardcoding one value for all in order to compare all to each other.
  • Equals() which says if two companies are considered the same one actually, by checking names. You can tweak it however you want and you find that is working on your test set (I guess some experiments are going to be necessary).

Below you can find my implementation when I assumed that companies are taken as the same if they differ by one word max.

public class Program
{
    public static void Main()
    {
        var companyNames = new[]
        {
            "Microsoft Ltd",
            "Microsoft",
            "Google Inc",
            "Google Drive Inc",
            "Amazon Ltd.",
            "Amazon Ltd.",
            "DropBox Corporation Ltd.",
            "DropBox Corporation",
            "Corporation DropBox"
        };

        var companies = companyNames.Select(cn => new Company {Name = cn});

        var groups = companies
            .GroupBy(c => c, new CompanyComparer())
            .Where(gr => gr.Count() > 1);

        PrintResults(groups);

        Console.ReadKey();
    }



    private static void PrintResults(IEnumerable<IGrouping<Company, Company>> groups)
    {
        foreach (var grp in groups)
        {
            foreach (var c in grp)
            {
                Console.WriteLine(c.Name);
            }
            Console.WriteLine();
        }
    }
}

public class Company
{
    public string Name { get; set; }
}

public class CompanyComparer : IEqualityComparer<Company>
{
    public bool Equals(Company x, Company y)
    {
        if (x?.Name == null || y?.Name == null) return false;

        var xWords = GetWordsSet(x.Name);
        var yWords = GetWordsSet(y.Name);

        // make company with more words first
        if (xWords.Count < yWords.Count)
        {
            var temp = xWords;
            xWords = yWords;
            yWords = temp;
        }

        var commonWords = xWords.Count(xWord => yWords.Contains(xWord));

        return xWords.Count - commonWords <= 1;
    }

    public int GetHashCode(Company obj) => 0; // only companies with same hash code will be compared

    private static ISet<string> GetWordsSet(string name) =>
        name.Split().Select(n => n.ToLower()).ToHashSet();
}

Which gives the output:

Microsoft Ltd
Microsoft

Google Inc
Google Drive Inc

Amazon Ltd.
Amazon Ltd.

DropBox Corporation Ltd.
DropBox Corporation
Corporation DropBox

Upvotes: 1

Ian Mercer
Ian Mercer

Reputation: 39297

You can do a certain amount of 'canonicalization' by removing punctuation and words like "Inc", "Corp" (see partial example below), and by removing parentheticals but ultimately this is a very hard problem because of (i) abbreviations; (ii) location specifiers (East, North, ...); (iii) corporate taxonomy: is it a subsidiary, a branch, a franchisee, or a separate company?

Ultimately a list of synonyms may be the best approach together with some light canonicalization to remove common corporate entity type designators.

    private static string Clean(string corporation)
    {
        corporation = corporation.EndsWith("Inc") ? corporation.Substring(0, corporation.Length - 3) : corporation;
        return corporation
            .Replace(" LLC", "")
            .Replace(" S.A.", "")
            .Replace(" SA", "")
            .Replace(" S.L.", "")
            .Replace(" SL", "")
            .Replace("(1)", "")
            .Replace(" GmbH", "")
            .Replace("(UK) Ltd.", "")
            .Replace(" Limited", "")
            .Replace(" Corporation", "")
            .Replace(" Corp.", "")
            .Replace(" Corp ", " ")
            .Replace(" Ltd.", "")
            .Replace(" Ltd", "")
            .Replace(" Inc.", "")
            .Replace("(Pa)", "")
            .Replace(" Inc ", " ")
            .Replace(" Corporation", "")
            .Replace(", LLP.", "")
            .Replace(" N.V.", "").Trim();
    }

Upvotes: 1

Related Questions