Reputation: 675
I am regularly faced with data frames which have a number of columns which need to be coalesced into a single column. The order of coalescing is based on the number of valid entries, with the variable with the most valid entries getting precedence. I have written code to provide a count of valid entries for each variable, and based on this output I manually define the order that the variables are to be coalesced. I was wondering if there is a way to automate this step so that I do not need to manually type the order to coalesce? Below is an example dataset.
df <- data.frame(1:16)
df$A <- c(10,0,0,0,0,0,12,12,0,14,NA_real_,14,NA_real_,NA_real_,16,16)
df$B <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16,16)
df$C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
A B C
10 10 10
0 0 12
0 0 14
0 0 16
0 12 10
0 12 12
12 12 14
12 12 16
0 0 10
14 14 12
NA NA 14
14 14 16
NA 16 10
NA 16 12
16 16 14
16 16 16
I have been determining which variable has the most valid entries using the code below (NA values and 0 are treated the same):
sum(df$A > 0, na.rm = TRUE)
#7
sum(df$B > 0, na.rm = TRUE)
#11
sum(df$C > 0, na.rm = TRUE)
#16
To coalesce the data I have used the code:
df %>% mutate(new_value = coalesce(C,B,A))
A B C new_value
10 10 10 10
0 0 12 12
0 0 14 14
0 0 16 16
0 12 10 10
0 12 12 12
12 12 14 14
12 12 16 16
0 0 10 10
14 14 12 12
NA NA 14 14
14 14 16 16
NA 16 10 10
NA 16 12 12
16 16 14 14
16 16 16 16
Upvotes: 3
Views: 55
Reputation: 40171
One dplyr
possibility could be:
df %>%
mutate(new_value = pull(select(., which.max(across(everything(), ~ sum(. > 0, na.rm = TRUE))))))
A B C new_value
1 10 10 10 10
2 0 0 12 12
3 0 0 14 14
4 0 0 16 16
5 0 12 10 10
6 0 12 12 12
7 12 12 14 14
8 12 12 16 16
9 0 0 10 10
10 14 14 12 12
11 NA NA 14 14
12 14 14 16 16
13 NA 16 10 10
14 NA 16 12 12
15 16 16 14 14
16 16 16 16 16
Upvotes: 1
Reputation: 389235
You can use colSums
to get count of number of values greater than 0 in df
, sort
them in decreasing
order and store the column names (cols
). Subset the data based on cols
and use coalesce
with do.call
.
cols <- names(sort(colSums(df[-1] > 0, na.rm =TRUE), decreasing = TRUE))
df$new_value <- do.call(dplyr::coalesce, df[cols])
We use -1
to exclude the first column which has values 1:16
. This will work for any number of columns.
Upvotes: 3