Reputation: 1221
I would like to compute the weighted median of a list of unique values and a list of weights. Weights tell how often each value is present in the list.
Example:
real_data = [1,1,2,3,3,4,4,4]
values = [1,2,3,4]
weights = [2,1,2,3]
One way to do it should be:
np.median(np.repeat(values, weights))
However I feel like this is a bit inefficient since it first generated the whole list, which could be a problem if the weights are very high. Is there a more efficient way of doing this?
Also, out of curiosity, can you think of a way to write np.repeat as a list comprehension?
Upvotes: 0
Views: 138
Reputation: 2525
The solution I came up with:
def median_3(weights, values):
s=0
n=sum(weights)
for i,w in enumerate(weights):
s+=w
if s>n/2:
if n%2 == 0:
if s-w==n/2:
return (values[i]+values[i-1])/2
else:
return values[i]
else:
return values[i]
The code for time comparison:
import timeit
def median_1(weights, values):
return np.median(np.repeat(values, weights))
def median_3(weights, values):
s=0
n=sum(weights)
for i,w in enumerate(weights):
s+=w
if s>n/2:
if n%2 == 0:
if s-w==n/2:
return (values[i]+values[i-1])/2
else:
return values[i]
else:
return values[i]
t1 = timeit.Timer(lambda: median_1(weights, values))
t3 = timeit.Timer(lambda: median_3(weights, values))
print(f"function median_1 for 1000 cycles: {t1.timeit(1000)} s")
print(f"function median 3 for 1000 cycles: {t3.timeit(1000)} s")
print(f" result from median_1 {median_1(weights, values)}")
print(f" result from median_3 {median_3(weights, values)}")
result:
function median_1 for 1000 cycles: 0.051409600000000055 s
function median 3 for 1000 cycles: 0.0013161999999999896 s
result from median_1 3.0
result from median_3 3
Hope this helps. It should also work for even number of elements.
Upvotes: 1