Reputation: 37458
As part of a test bench I'm building, I'm looking for a simple class to calculate a histogram of integer values (number of iterations taken for an algorithm to solve a problem). The answer should be called something like this:
Histogram my_hist = new Histogram();
for( uint i = 0; i < NUMBER_OF_RESULTS; i++ )
{
myHist.AddValue( some_result );
}
for( uint j = 0; j < myHist.NumOfBins; j++ )
{
Console.WriteLine( "{0} occurred {1} times", myHist.BinValues[j], myHist.BinCounts[j] );
}
I was suprised a bit of googling didn't turn up a neat solution but maybe I didn't search for the right things. Is there a generic solution out there or is it worth rolling my own?
Upvotes: 10
Views: 37439
Reputation: 35631
This builds on the accepted answer. The problem is that building a SortedDictionary
iteratively is slooow, because both insertions and retrievals cost O(log(N)).
That can be avoided if you don't need to display the histogram as it is accumulating.
My modification uses a normal Dictionary
and only sorts it at the end into a SortedList
.
For a sample size of 10M items, this version is about 11x faster (on my machine), at the cost of slightly higher memory usage until the GC kicks in (~10% extra memory).
//generate a random sample
Random r = new Random();
var items = Enumerable
.Range(1, 10_000_000)
.Select( _ => (uint)r.Next(100_000))
.ToList();
//build the histogram using a normal dictionary with O(1) lookups and insertions.
var tempHistogram = new Dictionary<uint, int>();
foreach (uint item in items)
{
if (tempHistogram.ContainsKey(item))
{
tempHistogram[item]++;
}
else
{
tempHistogram[item] = 1;
}
}
//Sort it once. SortedList conveniently has a ctor that takes a dictionary.
var sortedHistogram = new SortedList<uint, int>(tempHistogram);
foreach (KeyValuePair<uint, int> pair in sortedHistogram.Take(100))
{
Console.WriteLine("{0} occurred {1} times", pair.Key, pair.Value);
}
For really large samples (larger than the available memory) there are amazing probabilistic algorithms that solve this problem.
They are also quite good for streaming data.
Look for 'quantile sketches'. Here's an implementation from the Apache foundation: https://datasketches.apache.org/
Upvotes: 2
Reputation: 9085
My implementation of a simple extension method to create a histogram:
public static IReadOnlyDictionary<T, int> ToHistogram<T>(this IEnumerable<T> enumerable)
=> enumerable.GroupBy(item => item).ToDictionary(grouping => grouping.Key, grouping => grouping.Count());
Upvotes: 1
Reputation: 17574
You can use Linq:
var items = new[] {5, 6, 1, 2, 3, 1, 5, 2};
items
.GroupBy(i => i)
.Select(g => new {
Item = g.Key,
Count = g.Count()
})
.OrderBy(g => g.Item)
.ToList()
.ForEach(g => {
Console.WriteLine("{0} occurred {1} times", g.Item, g.Count);
});
Upvotes: 5
Reputation: 37458
Based on BastardSaint's suggestion I came up with a neat and fairly generic wrapper:
public class Histogram<TVal> : SortedDictionary<TVal, uint>
{
public void IncrementCount(TVal binToIncrement)
{
if (ContainsKey(binToIncrement))
{
this[binToIncrement]++;
}
else
{
Add(binToIncrement, 1);
}
}
}
So now I can do:
const uint numOfInputDataPoints = 5;
Histogram<uint> hist = new Histogram<uint>();
// Fill the histogram with data
for (uint i = 0; i < numOfInputDataPoints; i++)
{
// Grab a result from my algorithm
uint numOfIterationsForSolution = MyAlorithm.Run();
// Add the number to the histogram
hist.IncrementCount( numOfIterationsForSolution );
}
// Report the results
foreach (KeyValuePair<uint, uint> histEntry in hist.AsEnumerable())
{
Console.WriteLine("{0} occurred {1} times", histEntry.Key, histEntry.Value);
}
Took me a while to work out how to make it generic (to begin with I just overrode the SortedDictionary
constructor which meant you could only use it for uint
keys).
Upvotes: 6
Reputation: 34665
You could use SortedDictionary
uint[] items = new uint[] {5, 6, 1, 2, 3, 1, 5, 2}; // sample data
SortedDictionary<uint, int> histogram = new SortedDictionary<uint, int>();
foreach (uint item in items) {
if (histogram.ContainsKey(item)) {
histogram[item]++;
} else {
histogram[item] = 1;
}
}
foreach (KeyValuePair<uint, int> pair in histogram) {
Console.WriteLine("{0} occurred {1} times", pair.Key, pair.Value);
}
This will leave out empty bins, though
Upvotes: 22