Reputation: 4645
I have a data that I want to find out which column has the maximum value and output that column name. One issue is that if there is no maximum value (e.g, all numeric numbers equal) return all_equal
comment or if two column has equal max value compared to the third one output those two column name.
Here is the sample data
test <- data.frame(A=c(5,NA,NA,1,NA,NA,3,NA,NA),B=c(NA,2,NA,NA,1,NA,NA,1,NA),C=c(NA,NA,1,NA,NA,1,NA,NA,3),gr=gl(3,3))
A B C gr
1 5 NA NA 1
2 NA 2 NA 1
3 NA NA 1 1
4 1 NA NA 2
5 NA 1 NA 2
6 NA NA 1 2
7 3 NA NA 3
8 NA 1 NA 3
9 NA NA 3 3
In each gr
there is values in column A
, B and C
. My purpose is to find which column has the maximum value in that group and output that column name to new column called col_name
.
if all values are equal to each other as in gr=2
output is all_equal
if two of the column has max value comparing to the third column like in gr=3
output column names A&C
to the col_name
.
I realized that it might be difficult to build a pipeline without gather
so I tried;
library(dplyr)
test%>%
group_by(gr)%>%
gather(variable, value, -gr) %>%
arrange(gr)%>%
mutate(col_name=variable[which.max(value)])
# A tibble: 18 x 4
# Groups: gr [2]
r variable value col_name
<fct> <chr> <dbl> <chr>
1 1 A 5 A
2 1 A NA A
3 1 A NA A
4 1 B NA A
5 1 B 2 A
6 1 B NA A
7 1 C NA A
8 1 C NA A
9 1 C 1 A
10 2 A 1 A
11 2 A NA A
12 2 A NA A
13 2 B NA A
14 2 B 1 A
15 2 B NA A
16 2 C NA A
17 2 C NA A
18 2 C 1 A
The problem I am struggling in here is how to output all_equal
comment if all the max values are equal in columns A,B and C and
if 2 columns max value equals (A and C in gr=3) outputting those equal column names in this format A&C
in col_name
The expected output would be
> test
A B C gr col_name
1 5 NA NA 1 A
2 NA 2 NA 1 A
3 NA NA 1 1 A
4 1 NA NA 2 all_equal
5 NA 1 NA 2 all_equal
6 NA NA 1 2 all_equal
7 3 NA NA 3 A&C
8 NA 1 NA 3 A&C
9 NA NA 3 3 A&C
thx in advance!
Upvotes: 5
Views: 596
Reputation: 47350
Here are 2 base R
solutions
Using split<-
and aggregate
test0 <- aggregate(test[1:3], by = test[4], max,na.rm=T)[-1]
nms <- apply(do.call(pmax,test0) == test0, 1 , function(x) names(which(x)))
test$col_name <- NA
split(test$col_name, test$gr) <-
ifelse(lengths(nms) == 3, "all_equal", lapply(nms,paste,collapse="&"))
test
# A B C gr col_name
# 1 5 NA NA 1 A
# 2 NA 2 NA 1 A
# 3 NA NA 1 1 A
# 4 1 NA NA 2 all_equal
# 5 NA 1 NA 2 all_equal
# 6 NA NA 1 2 all_equal
# 7 3 NA NA 3 A&C
# 8 NA 1 NA 3 A&C
# 9 NA NA 3 3 A&C
details
We first coalesce test into test0
test0
# A B C
# 1 5 2 1
# 2 1 1 1
# 3 3 1 3
Then get the max rowwise with pmax
, the column indices with which
and the column names with names
.
We assign pasted names (with "all_equal"
exception) with split<-
, they're recycled to form the output.
Using stack
and ave
stacked <- cbind(na.omit(stack(test,-gr)), gr=levels(test$gr))
test$col_name <- with(stacked, ave(values, gr, FUN = function(x){
nms <- paste(names(test)[which(x == max(x))],collapse="&")
if (length(nms) == 3) "all_equal" else nms})[order(gr)])
# A B C gr col_name
# 1 5 NA NA 1 A
# 2 NA 2 NA 1 A
# 3 NA NA 1 1 A
# 4 1 NA NA 2 A&B&C
# 5 NA 1 NA 2 A&B&C
# 6 NA NA 1 2 A&B&C
# 7 3 NA NA 3 A&C
# 8 NA 1 NA 3 A&C
# 9 NA NA 3 3 A&C
Upvotes: 0
Reputation: 8120
Similar to akrun's answer
library(tidyverse)
test <- data_frame(A=c(5,NA,NA,1,NA,NA,3,NA,NA),B=c(NA,2,NA,NA,1,NA,NA,1,NA),C=c(NA,NA,1,NA,NA,1,NA,NA,3),gr=gl(3,3))
test %>% gather(key, value, -gr, na.rm = TRUE) %>%
group_by(gr) %>%
arrange(gr) %>%
mutate(col_name = if_else(length(which(value == max(value))) == length(unique(key)),
"all_equal",
paste(flatten(.[which(value == max(value)), "key"]), collapse = "&"))) %>%
spread(key, value)
#> # A tibble: 3 x 5
#> # Groups: gr [3]
#> gr col_name A B C
#> <fct> <chr> <dbl> <dbl> <dbl>
#> 1 1 A 5 2 1
#> 2 2 all_equal 1 1 1
#> 3 3 A&C 3 1 3
Created on 2018-08-09 by the reprex package (v0.2.0).
This will also condense the dataframe to remove all the extra NA
s
Upvotes: 2
Reputation: 887851
Here is an option using gather
. We gather
the data into 'long' format, arrange
by the group ('gr') and 'val' in descending order, grouped by 'gr', summarise
to create the 'col_name' for each 'gr' based on the condition described in the OP's post and right_join
with the original dataset
library(tidyverse)
test %>%
gather(key, val, -gr, na.rm = TRUE) %>%
arrange(gr, desc(val)) %>%
group_by(gr) %>%
summarise(col_name = case_when(n_distinct(val)==1 ~ "all_equal",
TRUE ~ paste(key[val==max(val)], collapse = "&"))) %>%
right_join(test) %>%
select(names(test), everything())
# A tibble: 9 x 5
# A B C gr col_name
# <dbl> <dbl> <dbl> <fct> <chr>
#1 5 NA NA 1 A
#2 NA 2 NA 1 A
#3 NA NA 1 1 A
#4 1 NA NA 2 all_equal
#5 NA 1 NA 2 all_equal
#6 NA NA 1 2 all_equal
#7 3 NA NA 3 A&C
#8 NA 1 NA 3 A&C
#9 NA NA 3 3 A&C
or using data.table
library(data.table)
library(stringr)
setDT(test)[, col_name := {
v1 <- sort(na.omit(unlist(.SD)), decreasing = TRUE)
if(uniqueN(v1)==1) "all_equal" else
paste(str_remove(names(v1)[v1==max(v1)], "\\d+"), collapse="&")
}, by = gr]
test
# A B C gr col_name
#1: 5 NA NA 1 A
#2: NA 2 NA 1 A
#3: NA NA 1 1 A
#4: 1 NA NA 2 all_equal
#5: NA 1 NA 2 all_equal
#6: NA NA 1 2 all_equal
#7: 3 NA NA 3 A&C
#8: NA 1 NA 3 A&C
#9: NA NA 3 3 A&C
Upvotes: 2
Reputation: 10336
Here is a dplyr
approach that I tried to make a little more generalized to accommodate a different number of columns of interest. With your test
data frame from above, start by defining a function that finds the max of the current group, gets indices for columns with matching values, then builds the output based on the number of matching columns:
foo <- function(df_, cols = 1:3) {
# Get max
m = max(df_[, cols], na.rm = TRUE)
# Get columns
ix <- as.data.frame(which(df_[, cols] == m, arr.ind = TRUE))[, 2]
matchlen = length(ix)
columns <- names(df_[,cols])[ix]
# Get varname based on length
out = ifelse(matchlen == length(cols), "all_equal", paste(columns, collapse = "&"))
df_$col_name = out
return(df_)
}
Because the output from that is a data frame, you need to make use of do
to apply it to groups with dplyr
:
test %>%
group_by(gr) %>%
do(foo(.))
# A tibble: 9 x 5
# Groups: gr [3]
A B C gr col_name
<dbl> <dbl> <dbl> <fct> <chr>
1 5 NA NA 1 A
2 NA 2 NA 1 A
3 NA NA 1 1 A
4 1 NA NA 2 all_equal
5 NA 1 NA 2 all_equal
6 NA NA 1 2 all_equal
7 3 NA NA 3 A&C
8 NA 1 NA 3 A&C
9 NA NA 3 3 A&C
The function should allow for a flexible number of columns to be input, as long as they're numeric. For example,
test %>%
group_by(gr) %>%
do(foo(., cols = 1:2))
and
test %>%
group_by(gr) %>%
do(foo(., cols = c(1,3)))
both seem to work.
Edit:
Yeah, I guess you can!
test %>%
group_by(gr) %>%
do(foo(., cols = c("A", "B", "C")))
Upvotes: 2
Reputation: 6073
A solution using a custom function and data.table:
myfun <- function(x) {
x <- as.matrix(x)
idx <- apply(x, 2, max, na.rm=T)==max(x, na.rm=T)
who <- colnames(x)[idx]
if(length(who)==1) return(who)
if(length(who)==2) return(paste0(who, collapse = "&"))
if(length(who)>2) return("all_equal")
}
library(data.table)
dt <- data.table(test)
dt[ , new := myfun(cbind(A,B,C)), by=gr]
Notice that I had to use ==max()
instead of which.max
to deal with ties. I also struggled to implement this with a base function like aggregate
so I went the data.table
way.
Upvotes: 1