Reputation: 2412
I have data in list like this:
Microsoft Ltd
Microsoft
Google Inc
Amazon Ltd.
Amazon Ltd.
DropBox Corporation Ltd.
DropBox Corporation
My current solution is able to detect exactly matching duplicates. As a result it will currently output:
Amazon Ltd.
Amazon Ltd.
I would like to add possibility so that these would be in the output list as well:
Microsoft Ltd
Microsoft
Amazon Ltd.
Amazon Ltd.
DropBox Corporation Ltd.
DropBox Corporation
Here is my current code:
var dups = companyList.AsEnumerable()
.Where(g => !string.IsNullOrWhiteSpace(g.Name))
.GroupBy(dr => dr.Name.Trim())
.Where(gr => gr.Count() > 1)
.SelectMany(g => g)
.OrderBy(c => c.Name)
.ToList();
I would be very thankful for any kind suggestion, that would lead into solution for achieving such a check? I personally think here is no any possible logical solution? Maybe only some kind of Levenshtein Distance calculation and detection based on score? If it is not anyhow possible, would be beneficial to get these at least (matching by multiple words, for example two):
DropBox Corporation Ltd.
DropBox Corporation
Upvotes: 1
Views: 103
Reputation: 676
You can write your own equality comparer where you define when two company names are taken as the same company. It needs to implement two methods:
GetHashCode()
which kind of segregates which companies will ever be compared - they need to have the same hash code value. In your case I don't see better choice then hardcoding one value for all in order to compare all to each other.Equals()
which says if two companies are considered the same one actually, by checking names. You can tweak it however you want and you find that is working on your test set (I guess some experiments are going to be necessary).Below you can find my implementation when I assumed that companies are taken as the same if they differ by one word max.
public class Program
{
public static void Main()
{
var companyNames = new[]
{
"Microsoft Ltd",
"Microsoft",
"Google Inc",
"Google Drive Inc",
"Amazon Ltd.",
"Amazon Ltd.",
"DropBox Corporation Ltd.",
"DropBox Corporation",
"Corporation DropBox"
};
var companies = companyNames.Select(cn => new Company {Name = cn});
var groups = companies
.GroupBy(c => c, new CompanyComparer())
.Where(gr => gr.Count() > 1);
PrintResults(groups);
Console.ReadKey();
}
private static void PrintResults(IEnumerable<IGrouping<Company, Company>> groups)
{
foreach (var grp in groups)
{
foreach (var c in grp)
{
Console.WriteLine(c.Name);
}
Console.WriteLine();
}
}
}
public class Company
{
public string Name { get; set; }
}
public class CompanyComparer : IEqualityComparer<Company>
{
public bool Equals(Company x, Company y)
{
if (x?.Name == null || y?.Name == null) return false;
var xWords = GetWordsSet(x.Name);
var yWords = GetWordsSet(y.Name);
// make company with more words first
if (xWords.Count < yWords.Count)
{
var temp = xWords;
xWords = yWords;
yWords = temp;
}
var commonWords = xWords.Count(xWord => yWords.Contains(xWord));
return xWords.Count - commonWords <= 1;
}
public int GetHashCode(Company obj) => 0; // only companies with same hash code will be compared
private static ISet<string> GetWordsSet(string name) =>
name.Split().Select(n => n.ToLower()).ToHashSet();
}
Which gives the output:
Microsoft Ltd
Microsoft
Google Inc
Google Drive Inc
Amazon Ltd.
Amazon Ltd.
DropBox Corporation Ltd.
DropBox Corporation
Corporation DropBox
Upvotes: 1
Reputation: 39297
You can do a certain amount of 'canonicalization' by removing punctuation and words like "Inc", "Corp" (see partial example below), and by removing parentheticals but ultimately this is a very hard problem because of (i) abbreviations; (ii) location specifiers (East, North, ...); (iii) corporate taxonomy: is it a subsidiary, a branch, a franchisee, or a separate company?
Ultimately a list of synonyms may be the best approach together with some light canonicalization to remove common corporate entity type designators.
private static string Clean(string corporation)
{
corporation = corporation.EndsWith("Inc") ? corporation.Substring(0, corporation.Length - 3) : corporation;
return corporation
.Replace(" LLC", "")
.Replace(" S.A.", "")
.Replace(" SA", "")
.Replace(" S.L.", "")
.Replace(" SL", "")
.Replace("(1)", "")
.Replace(" GmbH", "")
.Replace("(UK) Ltd.", "")
.Replace(" Limited", "")
.Replace(" Corporation", "")
.Replace(" Corp.", "")
.Replace(" Corp ", " ")
.Replace(" Ltd.", "")
.Replace(" Ltd", "")
.Replace(" Inc.", "")
.Replace("(Pa)", "")
.Replace(" Inc ", " ")
.Replace(" Corporation", "")
.Replace(", LLP.", "")
.Replace(" N.V.", "").Trim();
}
Upvotes: 1