Economist_Ayahuasca
Economist_Ayahuasca

Reputation: 1642

R: Obtaining the average of numeric variables while the max for binary variables in a dataframe

I have the following dataset:

Lines <- "id time sex Age A
1  1       male   90 0
1  2       male   91 0
1  3       male   92 1
2  1       female  87 0
2  2       female  88 0
2  3       female  89 0
3  1       male  50 0
3  2       male  51 1
3  3       male  52 0
4  1       female  54 0
4  2       female  55 0
4  3       female  56 0"

I would like to obtain the mean value for the numeric variable and the highest value for the binary variables (e.g. if 0 and 1, then 1). And for the string variables, just the name as this does not vary across id.

The resulting data frame should be something like this:

Lines <- "id time sex Age A
1  2       male   91 1
2  2       female  88 0
3  2       male  51 1
4  2       female  55 0"

I have more or less an idea:

Lines <- Lines %>%
    group_by(id, across(where(is.character))) %>%
    summarise(across(where(all(column %in% 0:1)), max)) %>%
    summarise(across(where(is.numeric), mean)) %>%
    ungroup

Upvotes: 1

Views: 66

Answers (1)

Anoushiravan R
Anoushiravan R

Reputation: 21928

I thought about using across function to apply mean on only numeric variables in one try, but I noticed they are actually stored as integer as well as your binary variable. So we either have to change their classes before our solution and use across or explicitly name every variable like this:

library(dplyr)

df %>%
  group_by(id) %>%
  summarise(id = first(id), sex = first(sex), time = mean(time), 
            Age = mean(Age), A = max(A))

# A tibble: 4 x 5
     id sex     time   Age     A
  <int> <chr>  <dbl> <dbl> <int>
1     1 male       2    91     1
2     2 female     2    88     0
3     3 male       2    51     1
4     4 female     2    55     0

Here is another solution, however I coerced some variables into classes mentioned in the question before applying our transformation. It would be a good idea if you could do it too:

df %>%
  group_by(id) %>%
  mutate(across(c(time, Age), as.numeric)) %>%
  summarise(id = first(id), sex = first(sex), 
            across(where(is.integer), max, .names = "{.col}"),
            across(where(is.double), mean, .names = "{.col}"))

# A tibble: 4 x 5
     id sex        A  time   Age
  <int> <chr>  <int> <dbl> <dbl>
1     1 male       1     2    91
2     2 female     0     2    88
3     3 male       1     2    51
4     4 female     0     2    55

Upvotes: 2

Related Questions