Reputation: 175
I have a dataframe regarding residents divided by age in certain towns. In other words I have something like this
Town Aged18 Aged19 Aged20 Aged21 Aged22 Aged23 Aged24 Aged25 Total
Rome 2 3 5 10 15 25 20 18 98
Milan 15 19 21 25 25 20 35 15 175
Turin 14 8 8 10 15 25 24 6 110
Florence 10 15 15 10 25 10 5 5 95
Bologna 15 10 25 10 15 10 25 20 130
I want to create a new column indicating in which "age" column does the median of the distribution of residents fall for each town. In other words I want to create something like this
Town Aged18 Aged19 Aged20 Aged21 Aged22 Aged23 Aged24 Aged25 Total Median
Rome 2 3 5 10 15 25 20 18 98 23
Milan 15 19 21 25 25 20 35 15 175 22
Turin 14 8 8 10 15 25 24 6 110 22,5
Florence 10 15 15 10 25 10 5 5 95 21
Bologna 15 10 25 10 15 10 25 20 130 22
Strictly speaking I want to create a loop which sums the content of each column until we reach the median position, which, if column's total is odd, corresponds to (n+1)/2, while, if it is even, it corresponds to (n/2+(n+1)/2)/2. This latter is the case of Turin in my dataframe, where I am averaging between 22 and 23 since they correspond to the columns where the 55th (110/2) and the 56th (111/2) observations are contained respectively.
Therefore I want the new column not to compute the median accross the values of the columns (which is something we can do through rowMedian), but I want it to return the column(s) which contains the median observation.
Can anyone help me with this?? Thank you so much, I hope this can be useful even for someone else trying to do something similar.
Upvotes: 2
Views: 277
Reputation: 66500
Here's an approach using the tidyverse.
First I gather the data into long form, which often simplifies group analysis. I also convert the Age columns into numbers.
Then, for each Town
, I count the cumulative people through that age. We mark a row as corresponding to the median if the prior cumulative count was less than halfway, but the current cumulative count is at least halfway. Then we add an adjustment for the special case where Total is even and the prior row cumulative was equal to the halfway total.
library(tidyverse)
df_long <-
df %>%
gather(age, value, Aged18:Aged25) %>%
mutate(age = str_remove(age, "Aged") %>% as.numeric()) %>%
arrange(Town, age) %>% # Probably not necessary but doesn't hurt
group_by(Town) %>%
mutate(cuml_count = cumsum(value),
median = lag(cuml_count < Total / 2, default = FALSE) & cuml_count >= Total / 2,
median = if_else(Total %% 2 == 0 & lag(cuml_count, default = FALSE) == Total / 2,
TRUE, median))
Here's a visual check:
ggplot(df_long, aes(age, cuml_count/Total, color = median)) + geom_point() + facet_wrap(~Town)
Finally, we can combine the original table with those medians:
df2 <- df %>%
left_join(df_long,
filter(median) %>%
group_by(Town) %>%
summarize(median = mean(age)))
Output:
> df2
Town Aged18 Aged19 Aged20 Aged21 Aged22 Aged23 Aged24 Aged25 Total median
1 Rome 2 3 5 10 15 25 20 18 98 23.0
2 Milan 15 19 21 25 25 20 35 15 175 22.0
3 Turin 14 8 8 10 15 25 24 6 110 22.5
4 Florence 10 15 15 10 25 10 5 5 95 21.0
5 Bologna 15 10 25 10 15 10 25 20 130 22.0
Upvotes: 2