Reputation: 10393
I know there are tons of similar questions on SO regarding this subject, but I couldn't quite find the answer I was looking for. Here's my requirement.
I have a long list of strings (easily upwards of 50,000 or even 100K items) in which I need to find the duplicate items. But just finding duplicates won't do; what I really want to do is go through the list and add an increment index at the end of each item to indicate the number of times an item repeats. To better illustrate let me take an example. My list actually contains paths, so the example roughly resembles that.
My original List:
AAA\BBB
AAA\CCC
AAA\CCC
BBB\XXX
BBB
BBB\XXX
BBB\XXX
My adjusted list with indices added:
AAA\BBB[1]
AAA\CCC[1]
AAA\CCC[2]
BBB\XXX[1]
BBB[1]
BBB\XXX[2]
BBB\XXX[3]
First I tried the following method using Linq:
List<string> originalList = new List<string>();
List<string> duplicateItems = new List<string>();
// pathList is a simple List<string> that contains my paths.
foreach (string item in pathList)
{
// Do some stuff here and pick 'item' only if it fits some criteria.
if (IsValid(item))
{
originalList.Add(item);
int occurences = originalList.Where(x => x.Equals(item)).Count();
duplicateItems.Add(item + "[" + occurences + "]");
}
}
This works just fine and gives me the desired result. The problem is it's painfully slow given that my list can contain 100K items. So I looked around and learned that HashSet could be a possible alternative that's potentially more efficient. But I can't quite figure out how I would get my exact desired result using that.
I could try something like this, I guess:
HashSet<string> originalList = new HashSet<string>();
List<string> duplicateItems = new List<string>();
foreach (string item in pathList)
{
// Do some stuff here and pick 'item' only if it fits some criteria.
if (IsValid(item))
{
if (!originalList.Add(item))
{
duplicateItems.Add(item + "[" + ??? + "]");
}
}
}
Later I could add "[1]" to all items in the HashSet, but how do I get the indices right (marked by the universal sign of confusion, ???, above) when adding an item to my duplicate list? I can't keep a reference int that I can pass to my method as there could be hundreds of different repeating items, each repeating different number of times as in my example.
Could I still use HashSet, or is there a better way of accomplishing my goal? Even a slight pointer in the right direction would be a great help.
Upvotes: 3
Views: 5962
Reputation: 1
Using a HashSet
Note: Dump() is a LinqPad method that prints the results to the screen — substitute as necessary.
void Main()
{
var list = new List<string> {"hello", "doctor", "name", "continue", "yesterday", "tomorrow", "HELLO"};
//case-insensitive string compare
list.HasDuplicates(StringComparer.OrdinalIgnoreCase).Dump();
//case-sensitive string compare
list.HasDuplicates().Dump();
//integer compare
var list2 = new List<int> { 1,2,3,4,5,2 };
list2.HasDuplicates().Dump();
}
public static class Test
{
public static bool HasDuplicates<T>(this IList<T> list, StringComparer stringComparer = null)
{
if (typeof(T) == typeof(string))
{
var hash = new HashSet<string>(list.Count, stringComparer);
foreach (var val in list) if (!hash.Add(val?.ToString())) break;
return hash.Count != list.Count;
}
else
{
var hash = new HashSet<T>(list.Count);
foreach (var val in list) if (!hash.Add(val)) break;
return hash.Count != list.Count;
}
}
}
/*
output:
True
False
True
*/
Upvotes: 0
Reputation: 10818
You could try this, although I have not performance tested it yet:
List<string> originalList = new List<string>()
{
@"AAA\BBB",
@"AAA\CCC",
@"AAA\CCC",
@"BBB\XXX",
@"BBB",
@"BBB\XXX",
@"BBB\XXX"
};
List<string> outputList = new List<string>();
foreach(var g in originalList.GroupBy(x => x).Select(x => x.ToList()))
{
var index = 1;
foreach(var item in g)
{
outputList.Add(string.Format("{0}[{1}]", item, index++));
}
}
Fiddle here
Upvotes: 4
Reputation: 863
You can use this crisp and crunchy code:
public static void Main()
{
var originalList = new List<string>()
{
@"AAA\BBB",
@"AAA\CCC",
@"AAA\CCC",
@"BBB\XXX",
@"BBB",
@"BBB\XXX",
@"BBB\XXX"
};
var outputList = originalList.GroupBy(x => x).SelectMany(x => x.Select((y, i) => string.Format("{0}[{1}]", y, i + 1)));
Console.WriteLine(string.Join("\n", outputList));
}
Upvotes: 0
Reputation: 22456
You could iterate over the list and use a dictionary to get the count, like this:
private int GetCount(IDictionary<string, int> counts, string item)
{
int count;
if (!counts.TryGetValue(item, out count))
count = 0;
count++;
counts[item] = count;
return count;
}
private IEnumerable<string> GetItems(IEnumerable<string> items)
{
// Initialize dict for counts with appropriate comparison
var counts = new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
foreach(var item in items)
yield return string.Format("{0}[{1}]", item, GetCount(counts, item));
}
Upvotes: 1
Reputation: 430
What about this?
static IEnumerable<string> MyCounter(IEnumerable<string> data)
{
var myDic = new Dictionary<string, int>();
foreach (var d in data)
{
if (!myDic.ContainsKey(d))
myDic[d] = 1;
else
myDic[d] = myDic[d] + 1 ;
yield return d +"[" + myDic[d] + "]";
}
}
Upvotes: 1
Reputation: 205629
Since you ask for fastest, the best IMO would be to use foreach
loop and counting Dictionary<string, int>
. It has the same time complexity as HashSet
and uses much less memory than LINQ GroupBy
:
var counts = new Dictionary<string, int>(pathList.Count); // specify max capacity to avoid rehashing
foreach (string item in pathList)
{
// Do some stuff here and pick 'item' only if it fits some criteria.
if (IsValid(item))
{
int count;
counts.TryGetValue(item, out count);
counts[item] = ++count;
duplicateItems.Add(item + "[" + count + "]");
}
}
Upvotes: 10
Reputation: 21147
You could just use Group() to pull the strings together and then project those groups using a combination of value and count.
Given your list of strings:
var listOfStrings;
var grouped = listOfStrings.GroupBy(x => x);
var groupedCount = grouped.Select(x => new {key = x.Key, count = group.Count()});
Upvotes: 0