Marco Mello
Marco Mello

Reputation: 175

Median across columns using a loop

I have a dataframe regarding residents divided by age in certain towns. In other words I have something like this

Town        Aged18 Aged19 Aged20 Aged21 Aged22 Aged23 Aged24 Aged25 Total
Rome        2      3      5      10     15     25     20     18     98
Milan       15     19     21     25     25     20     35     15     175
Turin       14     8      8      10     15     25     24     6      110
Florence    10     15     15     10     25     10     5      5      95
Bologna     15     10     25     10     15     10     25     20     130

I want to create a new column indicating in which "age" column does the median of the distribution of residents fall for each town. In other words I want to create something like this

Town    Aged18 Aged19 Aged20 Aged21 Aged22 Aged23 Aged24 Aged25 Total Median
Rome     2      3      5      10     15     25     20     18     98   23
Milan    15     19     21     25     25     20     35     15     175  22
Turin    14     8      8      10     15     25     24     6      110  22,5
Florence 10     15     15     10     25     10     5      5      95   21
Bologna  15     10     25     10     15     10     25     20     130  22

Strictly speaking I want to create a loop which sums the content of each column until we reach the median position, which, if column's total is odd, corresponds to (n+1)/2, while, if it is even, it corresponds to (n/2+(n+1)/2)/2. This latter is the case of Turin in my dataframe, where I am averaging between 22 and 23 since they correspond to the columns where the 55th (110/2) and the 56th (111/2) observations are contained respectively.

Therefore I want the new column not to compute the median accross the values of the columns (which is something we can do through rowMedian), but I want it to return the column(s) which contains the median observation.

Can anyone help me with this?? Thank you so much, I hope this can be useful even for someone else trying to do something similar.

Upvotes: 2

Views: 277

Answers (1)

Jon Spring
Jon Spring

Reputation: 66500

Here's an approach using the tidyverse.

First I gather the data into long form, which often simplifies group analysis. I also convert the Age columns into numbers.

Then, for each Town, I count the cumulative people through that age. We mark a row as corresponding to the median if the prior cumulative count was less than halfway, but the current cumulative count is at least halfway. Then we add an adjustment for the special case where Total is even and the prior row cumulative was equal to the halfway total.

library(tidyverse)
df_long <-
  df %>%
    gather(age, value, Aged18:Aged25) %>%
    mutate(age = str_remove(age, "Aged") %>% as.numeric()) %>%
    arrange(Town, age) %>%  # Probably not necessary but doesn't hurt
    group_by(Town) %>%
    mutate(cuml_count = cumsum(value),
           median     = lag(cuml_count < Total / 2, default = FALSE) & cuml_count >= Total / 2,
           median     = if_else(Total %% 2 == 0 & lag(cuml_count, default = FALSE) == Total / 2, 
                                TRUE, median))

Here's a visual check:

ggplot(df_long, aes(age, cuml_count/Total, color = median)) + geom_point() + facet_wrap(~Town)

enter image description here

Finally, we can combine the original table with those medians:

df2 <- df %>% 
  left_join(df_long,
            filter(median) %>%
            group_by(Town) %>%
            summarize(median = mean(age)))

Output:

> df2
      Town Aged18 Aged19 Aged20 Aged21 Aged22 Aged23 Aged24 Aged25 Total median
1     Rome      2      3      5     10     15     25     20     18    98   23.0
2    Milan     15     19     21     25     25     20     35     15   175   22.0
3    Turin     14      8      8     10     15     25     24      6   110   22.5
4 Florence     10     15     15     10     25     10      5      5    95   21.0
5  Bologna     15     10     25     10     15     10     25     20   130   22.0

Upvotes: 2

Related Questions