Reputation: 159
Sorry this post is not related to coding but more to data structures and Algorithms.
I'm having large amount of data each having different frequencies. The approximate figure plot seems to be a Bell curve. I now want to display the data in ranges which most precisely describes the frequency of the ranges.
e.g. the entire range of data has total no. of frequencies but this range or bucket size is not precise and may be made more precise.(e.g if some data is more concentrated in a particular frequency zone, we may build up a bucket with less data size but having more closely related frequencies.)
Any help regarding some algorithm .
I thought of an algorithm related to binary search.
Any ideas folks.
Upvotes: 0
Views: 4481
Reputation: 134
First count all the indexes then subtract the repeating values this will give you optimal number of buckets. but at small level
Upvotes: -1
Reputation: 178491
Not sure I am following, but it seems you are looking for k
beans, where for each two beans, the probability of the data falling in one bean is identical for it being in the other bean.
From your description, your data seems to be normally distributed, or T-distributed.
One can evaluate the mean and standard deviation of the data, let the extracted S.D. be s
and the mean be u
.
The standard formulas for evaluating the mean and S.D. from the sample are1:
u = (x1 + x2 + ... + xn) / n (simple average)
s^2 = Sigma((xi - u)^2)/(n-1)
Given this information, you can evaluate the distribution of your data, which is N(u,s^2)
. Given this information, you can create a random variabe: X~N(u,s^2)
2
Now all is left is finding the a,b,... as follows (assuming 10 buckets, this can obviously be modified as you wish):
P(X<a) = 0.1
P(X<b) = 0.2
P(X<c) = 0.3
...
After finding a,b,c,... you have your beans: (-infinity,a], (a,b], (a,c], ...
(1) evaluating variance: http://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance
(2)The real distribution for this variable is actually t-distribution, since the variance is unknown - and extracted from the data. However - for large enough n
- t-distribution decays into normal distribution.
Upvotes: 5