Reputation: 1378
How can I get a proper median
calculation on data that has been already aggregated?
For example, if I have a data frame that looks like this:
> df <- data.frame(name = c("A","B","C","D"), count = c(1,3,5,2), avg = c(100,50,20,10))
> df
# A tibble: 4 × 3
name count avg
<chr> <dbl> <dbl>
1 A 1 100
2 B 3 50
3 C 5 20
4 D 2 10
Assume we don't know much what's inside the bins, but assume there is little variation within bins. To the best of our knowledge, we would line up the values like this:
10 10 20 20 20 20 20 50 50 50 100
Out of 11 values, a median would be the 6th one, which is 20
But if I simply take the median()
, R takes it over 4 values: 10, 20, 50, 100
> median(df$avg)
[1] 35
Which is not what I want.
How can I go around this and "unfold" the data set?
Upvotes: 3
Views: 1915
Reputation: 1378
It was solved as commented by Zheyuan Li. It is simple, and I'm surprised I didn't know about it.
with(df, median(rep.int(avg, count)) )
Upvotes: 10