Jon Cage
Jon Cage

Reputation: 37458

Simple histogram generation of integer data in C#

As part of a test bench I'm building, I'm looking for a simple class to calculate a histogram of integer values (number of iterations taken for an algorithm to solve a problem). The answer should be called something like this:

Histogram my_hist = new Histogram();

for( uint i = 0; i < NUMBER_OF_RESULTS; i++ )
{

    myHist.AddValue( some_result );
}

for( uint j = 0; j < myHist.NumOfBins; j++ )
{
     Console.WriteLine( "{0} occurred {1} times", myHist.BinValues[j], myHist.BinCounts[j] );
}

I was suprised a bit of googling didn't turn up a neat solution but maybe I didn't search for the right things. Is there a generic solution out there or is it worth rolling my own?

Upvotes: 10

Views: 37439

Answers (5)

Cristian Diaconescu
Cristian Diaconescu

Reputation: 35631

This builds on the accepted answer. The problem is that building a SortedDictionary iteratively is slooow, because both insertions and retrievals cost O(log(N)).

That can be avoided if you don't need to display the histogram as it is accumulating.

My modification uses a normal Dictionary and only sorts it at the end into a SortedList.

For a sample size of 10M items, this version is about 11x faster (on my machine), at the cost of slightly higher memory usage until the GC kicks in (~10% extra memory).

//generate a random sample
Random r = new Random();
var items = Enumerable
    .Range(1, 10_000_000)
    .Select( _ => (uint)r.Next(100_000))
    .ToList();

//build the histogram using a normal dictionary with O(1) lookups and insertions.
var tempHistogram = new Dictionary<uint, int>();
foreach (uint item in items)
{
    if (tempHistogram.ContainsKey(item))
    {
        tempHistogram[item]++;
    }
    else
    {
        tempHistogram[item] = 1;
    }
}

//Sort it once. SortedList conveniently has a ctor that takes a dictionary.
var sortedHistogram = new SortedList<uint, int>(tempHistogram);

foreach (KeyValuePair<uint, int> pair in sortedHistogram.Take(100))
{
    Console.WriteLine("{0} occurred {1} times", pair.Key, pair.Value);
}

For really large samples (larger than the available memory) there are amazing probabilistic algorithms that solve this problem.
They are also quite good for streaming data. Look for 'quantile sketches'. Here's an implementation from the Apache foundation: https://datasketches.apache.org/

Upvotes: 2

Mugen
Mugen

Reputation: 9085

My implementation of a simple extension method to create a histogram:

public static IReadOnlyDictionary<T, int> ToHistogram<T>(this IEnumerable<T> enumerable)
   => enumerable.GroupBy(item => item).ToDictionary(grouping => grouping.Key, grouping => grouping.Count());

Upvotes: 1

qxn
qxn

Reputation: 17574

You can use Linq:

var items = new[] {5, 6, 1, 2, 3, 1, 5, 2};
items
    .GroupBy(i => i)
    .Select(g => new {
        Item = g.Key,
        Count = g.Count()
    })
    .OrderBy(g => g.Item)
    .ToList()
    .ForEach(g => {
        Console.WriteLine("{0} occurred {1} times", g.Item, g.Count);
    });

Upvotes: 5

Jon Cage
Jon Cage

Reputation: 37458

Based on BastardSaint's suggestion I came up with a neat and fairly generic wrapper:

public class Histogram<TVal> : SortedDictionary<TVal, uint>
{
    public void IncrementCount(TVal binToIncrement)
    {
        if (ContainsKey(binToIncrement))
        {
            this[binToIncrement]++;
        }
        else
        {
            Add(binToIncrement, 1);
        }
    }
}

So now I can do:

const uint numOfInputDataPoints = 5;
Histogram<uint> hist = new Histogram<uint>();

// Fill the histogram with data
for (uint i = 0; i < numOfInputDataPoints; i++)
{
    // Grab a result from my algorithm
    uint numOfIterationsForSolution = MyAlorithm.Run();

    // Add the number to the histogram
    hist.IncrementCount( numOfIterationsForSolution );
}

// Report the results
foreach (KeyValuePair<uint, uint> histEntry in hist.AsEnumerable())
{
    Console.WriteLine("{0} occurred {1} times", histEntry.Key, histEntry.Value);
}

Took me a while to work out how to make it generic (to begin with I just overrode the SortedDictionary constructor which meant you could only use it for uint keys).

Upvotes: 6

Steef
Steef

Reputation: 34665

You could use SortedDictionary

uint[] items = new uint[] {5, 6, 1, 2, 3, 1, 5, 2}; // sample data
SortedDictionary<uint, int> histogram = new SortedDictionary<uint, int>();
foreach (uint item in items) {
    if (histogram.ContainsKey(item)) {
        histogram[item]++;
    } else {
        histogram[item] = 1;
    }
}
foreach (KeyValuePair<uint, int> pair in histogram) {
    Console.WriteLine("{0} occurred {1} times", pair.Key, pair.Value);
}

This will leave out empty bins, though

Upvotes: 22

Related Questions