Reputation: 4553
Trying to remove duplicates from a list where a duplicate is when either the first, the second or both properties are equal (appear more than once in the list). Using MoreLINQ, the code below is working:
var list = new List<LinqTest> // LinqTest: object containing 2 strings
{
// are ok
new LinqTest { Str1 = "a1", Str2 = "b1"},
new LinqTest { Str1 = "a2", Str2 = "b2"},
new LinqTest { Str1 = "a3", Str2 = "b3"},
new LinqTest { Str1 = "a5", Str2 = "b5"},
new LinqTest { Str1 = "a6", Str2 = "b6"},
new LinqTest { Str1 = "x1", Str2 = "y1"},
new LinqTest { Str1 = "y1", Str2 = "x1"},
// must be removed
new LinqTest { Str1 = "d1", Str2 = "b4"},
new LinqTest { Str1 = "d1", Str2 = "d2"},
new LinqTest { Str1 = "d1", Str2 = "d2"},
new LinqTest { Str1 = "a4", Str2 = "d2"},
new LinqTest { Str1 = "d3", Str2 = "b7"},
new LinqTest { Str1 = "d3", Str2 = "b8"},
new LinqTest { Str1 = "d3", Str2 = "b8"},
};
var duplicatesStr1 = list
.GroupBy(x => x.Str1)
.Where(x => x.Count() > 1)
.SelectMany(x => x)
.ToList();
var duplicatesStr2 = list
.GroupBy(x => x.Str2)
.Where(x => x.Count() > 1)
.SelectMany(x => x)
.ToList(); ;
var res = list
.ExceptBy(duplicatesStr1, x => x.Str1)
.ExceptBy(duplicatesStr2, x => x.Str2);
var rem = duplicatesStr1
.Union(duplicatesStr2)
.DistinctBy(x => new { x.Str1, x.Str2})
.ToList();
Console.WriteLine("----------");
foreach (var linqTest in res)
{
Console.WriteLine("keep> " + linqTest.Str1 + "-" + linqTest.Str2);
}
Console.WriteLine("----------");
foreach (var linqTest in rem)
{
Console.WriteLine("remove> " + linqTest.Str1 + "-" + linqTest.Str2);
}
Question: Is there a more efficient and/or shorter way to achieve this?
Upvotes: 0
Views: 1916
Reputation: 11
Here is another approach using standard LINQ ToDictionary() method to remove items that are duplicated based on property comparison. Create a dictionary that accumulates the count of matches throughout the list, and filter the resulting dictionary for those with a single match.
list = list.ToDictionary(
test => test,
test => list.Count(item => item.Str1 == test.Str1 || item.Str2 == test.Str2)
).Where(pair => pair.Value == 1)
.Select(pair => pair.Key)
.ToList();
Even better would be to create an extension method in a separate class and push the unique property comparison logic to there.
internal static class EnumerableExtensions
{
public static IEnumerable<T> Unique<T>(
this IEnumerable<T> sequence,
Func<T, T, bool> match)
{
var list = sequence.ToList();
return list
.ToDictionary(arg => arg, arg => list.Count(item => match(item, arg)))
.Where(pair => pair.Value == 1)
.Select(pair => pair.Key);
}
}
Exposed as an extension method would result in a method signature of:
list = list.Unique((a, b) => a.Str1 == b.Str1 || a.Str2 == b.Str2).ToList();
Upvotes: 1
Reputation: 32296
You could do two passes by first getting all the values that exist for the first and second properties more than once, and then filter on that. You'd need 2 hash sets for each property. The first one is to keep track if a value has been seen at least once. And if it has been seen at least once it's then added to the second hash set. Thus resulting in one hash set for each property that only contains the values that are duplicated. Then just filter out any items that are in either of those hash sets.
HashSet<string> hash1Once = new HashSet<string>();
HashSet<string> hash1More = new HashSet<string>();
HashSet<string> hash2Once = new HashSet<string>();
HashSet<string> hash2More = new HashSet<string>();
foreach(var item in list){
if(!hash1Once.Add(item.Str1))
hash1More.Add(item.Str1);
if(!hash2Once.Add(item.Str2))
hash2More.Add(item.Str2);
}
var unique = list.Where(x => !hash1More.Contains(x.Str1) && !hash2More.Contains(x.Str2))
.ToList();
Upvotes: 0
Reputation: 17258
You can use the .Distinct
LINQ method to do this. You'll have to define a custom IEqualityComparer
to decide when two elements are considered distinct.
public class MyComparer : IEqualityComparer<LinqTest>
{
public bool Equals(LinqTest x, LinqTest y)
{
return x.Str1 == y.Str1 || x.Str2 == y.Str2;
}
public int GetHashCode(LinqTest obj)
{
return 0;
}
}
Then you can write:
List<LinqTest> noDuplicates = originalList.Distinct(new MyComparer()).ToList();
The tricky part is part is implementing the IEqualityComparer
properly (and I didn't the first time around!). GetHashCode()
must return the same value for two objects that are considered equal. Since our notion of equality is not transitive, the only way to satisfy this requirement is to return a constant value. This is allowed, but defeats the goal of the hash codes, which are a way to speed up equality checking: if the hash codes differ, the objects must differ without needing potentially more costly "deep" comparison.
So, this code works, but does not achieve the highest possible performance.
Upvotes: 5