Reputation: 1199
I have a long list of ints and I would like to calculate the percentage of numbers which are higher or above one tenth of the mean. That is, I want to calculate the percentile of the score mean / 10
. Here is a naive approach (in Python, but that doesn't matter):
ls = [35,35,73,23,40,60,5,7,3,4,1,1,1,1,1]
length = 0
summ = 0
for i in ls:
length += 1
summ += i
mean = float(summ) / float(length)
print('The input value list is: {}'.format(ls))
print('The mean is: {}'.format(mean))
tenth_mean = mean / 10
print('One tenth of the mean is: {}'.format(tenth_mean))
summ = 0
for i in ls:
if (i >= tenth_mean):
summ += 1
result = float(summ) / float(length)
print('The percentage of values equal or above one tenth of the mean is: {}'.format(result))
Output:
The input value list is: [35, 35, 73, 23, 40, 60, 5, 7, 3, 4, 1, 1, 1, 1, 1]
The mean is: 19.3333333333
One tenth of the mean is: 1.93333333333
The percentage of values equal or above one tenth of the mean is: 0.666666666667
The problem with this approach is that I have to loop over the list twice. Is there any smart way to avoid this?
I can't see any since I first need to calculate the average in order to know which values to keep in the count (second loop).
Furthermore, I would like to do this for multiple percentages (i.e. one tenth of the mean, one fifth of the mean, etc.). This can be easily achieved within the second loop. I just wanted to point this out.
The input array does not follow any distribution.
EDIT: The range of possible values is only in the couple of thousands. The total number of values is around 3 billion.
EDIT: Fixed usage of the word "percentile" above.
Upvotes: 0
Views: 1491
Reputation: 1199
Based on the answer from others I have come up with the following approach for an improved search: The key insight is that for every possible value x one can count and sort all occurrences of values smaller or equal to x. Independently, the mean can be calculated in parallel (i.e. in the same loop). One can then do a linear or binary search in the tuple list to calculate any arbitrary fraction. This works very well when the number of possible different values is much smaller than the total number of values.
Here is a simple implementation in bash/awk:
# The "tee >(awk ... > meant.txt) calculates the mean on the fly
# The second awk ("... value2count ...") counts the occurences of each value
# The sort simply sorts the output of awk (could be done within awk, too)
# The third awk ("... value2maxline ...") counts the number of lines having value x or less ("prevc" = previous count, "prevv" = previous value)
# The sort simply sorts the output of awk (could be done within awk, too)
echo -n "10\n15\n15\n20\n20\n25" | tee >(awk '{ sum += $1; } END { print sum / NR; }' > mean.txt) | awk '{ value2count[$1]++ } END { for (value in value2count) { print value, value2count[value] } }' | sort --numeric-sort --stable -k 1,1 | awk 'BEGIN { prevc = 0 ; prevv = -1 } { if (prevv != $1) { value2maxline[$1] = prevc + $2 ; prevc += $2 ; prevv = $1 } } END { for (value in value2maxline) { print value, value2maxline[value] } }' | sort --numeric-sort --stable -k 1,1 > counts.txt
cat mean.txt
17.5
cat counts.txt
10 1 # one line with value 10
15 3 # 3 lines with value 15 or less
20 5 # 5 lines with value 20 or less
25 6 # 6 lines with value 25 or less, 6 is also the total number of values
In the example above, if I were interested in the percentage of values >= 70% of the mean, I would calculate
int(0.7 * 17.5) = 12
Then find (with linear or binary search in the tuple list) that 1
line (of 6
total lines) is covered by less then 12
("10 1
" is still below, "15 3
" already above). Finally, I'd calculate (6-1) / 6 = 0.83
: 83% percent of the values are higher or equal then 70% of the mean.
Upvotes: 0
Reputation: 77857
This is a well-known result of stats and information science: you cannot get all of that information with a single pass. @OmG already gave you the best complexity. Depending on the distribution of your scores, you may be able to improve the search time (but not the complexity) with an interpolation search.
If you have a massive data set, you might also be able to improve the search's starting point with partial estimates of the mean as you progress.
Upvotes: 1
Reputation: 18838
If you have many queries on the list, it might be helpful do some preprocess to decrease time complexity up to O(log(n))
.
If you sort the list and compute mean (using python function) of the list, you can find percentiles in the list using binary search. Hence, query time would be O(log(n))
.
Upvotes: 1