Reputation: 614
I'm looking to produce a data histogram from a given dataset. I've read about different options for constructing the histogram and I'm most interested in a method based on the work of
Shimazaki, H.; Shinomoto, S. (2007). "A method for selecting the bin size of a time histogram"
The above method uses estimation to determine the optimal bin width and distribution, which is needed in my case because the sample data will vary in distribution and hard to determine the bin count and width in advance.
Can someone recommend a good source or a starting point for writing such a function in c# or have a close enough c# histogram code.
Many thanks.
Upvotes: 3
Views: 2865
Reputation: 4392
A small update to nick_w answer.
If you actually need the bins after. Plus optimized the double loop in histogram function away, plus got rid of linspace function.
/// <summary>
/// Calculate the optimal bins for the given data
/// </summary>
/// <param name="x">The data you have</param>
/// <param name="xMin">The minimum element</param>
/// <param name="optimalBinWidth">The width between each bin</param>
/// <returns>The bins</returns>
public static int[] CalculateOptimalBinWidth(List<double> x, out double xMin, out double optimalBinWidth)
{
var xMax = x.Max();
xMin = x.Min();
optimalBinWidth = 0;
const int MIN_BINS = 1;
const int MAX_BINS = 20;
int[] minKi = null;
var minOffset = double.MaxValue;
foreach (var n in Enumerable.Range(MIN_BINS, MAX_BINS - MIN_BINS).Select(v => v*5))
{
var d = (xMax - xMin)/n;
var ki = Histogram(x, n, xMin, d);
var ki2 = ki.Skip(1).Take(ki.Length - 2).ToArray();
var mean = ki2.Average();
var variance = ki2.Select(v => Math.Pow(v - mean, 2)).Sum()/n;
var offset = (2*mean - variance)/Math.Pow(d, 2);
if (offset < minOffset)
{
minKi = ki;
minOffset = offset;
optimalBinWidth = d;
}
}
return minKi;
}
private static int[] Histogram(List<double> data, int count, double xMin, double d)
{
var histogram = new int[count];
foreach (var t in data)
{
var bucket = (int) Math.Truncate((t - xMin)/d);
if (count == bucket) //fix xMax
bucket --;
histogram[bucket]++;
}
return histogram;
}
Upvotes: 1
Reputation: 83
I would recommend binary search to speed up the assignment to the class intervals.
public void Add(double element)
{
if (element < Bins.First().LeftBound || element > Bins.Last().RightBound)
return;
var min = 0;
var max = Bins.Length - 1;
var index = 0;
while (min <= max)
{
index = min + ((max - min) / 2);
if (element >= Bins[index].LeftBound && element < Bins[index].RightBound)
break;
if (element < Bins[index].LeftBound)
max = index - 1;
else
min = index + 1;
}
Bins[index].Count++;
}
"Bins" is a list of items of type "HistogramItem" which defines properties like "Leftbound", "RightBound" and "Count".
Upvotes: 0
Reputation: 1083
Check the Histogram function. If any data elements are unlucky to be equal to a bin boundary (other than the first or last bin), they will be counted in both consecutive bins. The code needs to check (lower <= data[j] && data[j] < upper) and handle the case that all elements equal to xMax go into the last bin.
Upvotes: 1
Reputation: 14938
The following is a port I wrote of the Python version of this algorithm from here. I know the API could do with some work, but this should be enough to get you started. The results of this code are identical to those produced by the Python code for the same input data.
public class HistSample
{
public static void CalculateOptimalBinWidth(double[] x)
{
double xMax = x.Max(), xMin = x.Min();
int minBins = 4, maxBins = 50;
double[] N = Enumerable.Range(minBins, maxBins - minBins)
.Select(v => (double)v).ToArray();
double[] D = N.Select(v => (xMax - xMin) / v).ToArray();
double[] C = new double[D.Length];
for (int i = 0; i < N.Length; i++)
{
double[] binIntervals = LinearSpace(xMin, xMax, (int)N[i] + 1);
double[] ki = Histogram(x, binIntervals);
ki = ki.Skip(1).Take(ki.Length - 2).ToArray();
double mean = ki.Average();
double variance = ki.Select(v => Math.Pow(v - mean, 2)).Sum() / N[i];
C[i] = (2 * mean - variance) / (Math.Pow(D[i], 2));
}
double minC = C.Min();
int index = C.Select((c, ix) => new { Value = c, Index = ix })
.Where(c => c.Value == minC).First().Index;
double optimalBinWidth = D[index];
}
public static double[] Histogram(double[] data, double[] binEdges)
{
double[] counts = new double[binEdges.Length - 1];
for (int i = 0; i < binEdges.Length - 1; i++)
{
double lower = binEdges[i], upper = binEdges[i + 1];
for (int j = 0; j < data.Length; j++)
{
if (data[j] >= lower && data[j] <= upper)
{
counts[i]++;
}
}
}
return counts;
}
public static double[] LinearSpace(double a, double b, int count)
{
double[] output = new double[count];
for (int i = 0; i < count; i++)
{
output[i] = a + ((i * (b - a)) / (count - 1));
}
return output;
}
}
Run it like this:
double[] x =
{
4.37, 3.87, 4.00, 4.03, 3.50, 4.08, 2.25, 4.70, 1.73,
4.93, 1.73, 4.62, 3.43, 4.25, 1.68, 3.92, 3.68, 3.10,
4.03, 1.77, 4.08, 1.75, 3.20, 1.85, 4.62, 1.97, 4.50,
3.92, 4.35, 2.33, 3.83, 1.88, 4.60, 1.80, 4.73, 1.77,
4.57, 1.85, 3.52, 4.00, 3.70, 3.72, 4.25, 3.58, 3.80,
3.77, 3.75, 2.50, 4.50, 4.10, 3.70, 3.80, 3.43, 4.00,
2.27, 4.40, 4.05, 4.25, 3.33, 2.00, 4.33, 2.93, 4.58,
1.90, 3.58, 3.73, 3.73, 1.82, 4.63, 3.50, 4.00, 3.67,
1.67, 4.60, 1.67, 4.00, 1.80, 4.42, 1.90, 4.63, 2.93,
3.50, 1.97, 4.28, 1.83, 4.13, 1.83, 4.65, 4.20, 3.93,
4.33, 1.83, 4.53, 2.03, 4.18, 4.43, 4.07, 4.13, 3.95,
4.10, 2.27, 4.58, 1.90, 4.50, 1.95, 4.83, 4.12
};
HistSample.CalculateOptimalBinWidth(x);
Upvotes: 8