jerlich
jerlich

Reputation: 362

Get the mapping from each element of input to the bin of the histogram in Julia

Matlab's [n,mapx] = histc(x, bin_edged) returns the counts of x in each bin as n and returns a map, which is the same length of x which is the bin index that each element of x was placed into.

I can do the same thing in Julia as follows:

Using StatsBase
x = rand(1000)
bin_e = 0:0.1:1
h = fit(Histogram, x, bin_e)
yx = map((z) -> findnext(z.<=h.edges[1],1),x) .- 1

Is this the "right way" to do this? It seem a bit kludgy.

Upvotes: 3

Views: 968

Answers (3)

glwhart
glwhart

Reputation: 334

I stumbled across this question when I was trying to figure out how many occurrences of each value I had in a list of values. If each value is in its own bin (as for categorical data, or integer data with a small number of unique values), this is what one would be plotting in a histogram.

If that is what you want, then countmap() in StatBase package is just what you need.

Upvotes: 1

jerlich
jerlich

Reputation: 362

After looking through the code for Histogram.jl I found that they already included a function binindex. So this solution is probably the best:

x = 0:0.001:10
h1 = fit(Histogram,x,0:10,closed=left)
xmap1 = StatsBase.binindex.(Ref(h1), x)
h2 = fit(Histogram,x,0:10,closed=right)
xmap2 = StatsBase.binindex.(Ref(h2), x)

Upvotes: 2

carstenbauer
carstenbauer

Reputation: 10127

Inspired by this python question you should be able to define a small function that delivers the desired mapping (modulo conventions):

binindices(edges, data) = searchsortedlast.(Ref(edges), data)

Note that the bin edges are sorted and we can use seachsortedlast to get the last bin edge smaller or equal than a datapoint. Broadcasting this over all of the data we obtain the mapping. Note that the Ref(edges) indicates that edges is a scalar under broadcasting (that means that the full array is considered in each call).

Although conceptionally identical to your solution, this approach is about 13x faster on my machine.

I filed an issue over at StatsBase.jl's github page suggesting to add this as a feature.

Upvotes: 4

Related Questions