Alexander
Alexander

Reputation: 4645

How to output the columns with the maximum value

I have a data that I want to find out which column has the maximum value and output that column name. One issue is that if there is no maximum value (e.g, all numeric numbers equal) return all_equal comment or if two column has equal max value compared to the third one output those two column name.

Here is the sample data

test <- data.frame(A=c(5,NA,NA,1,NA,NA,3,NA,NA),B=c(NA,2,NA,NA,1,NA,NA,1,NA),C=c(NA,NA,1,NA,NA,1,NA,NA,3),gr=gl(3,3))

   A  B  C gr
1  5 NA NA  1
2 NA  2 NA  1
3 NA NA  1  1
4  1 NA NA  2
5 NA  1 NA  2
6 NA NA  1  2
7  3 NA NA  3
8 NA  1 NA  3
9 NA NA  3  3

In each gr there is values in column A, B and C. My purpose is to find which column has the maximum value in that group and output that column name to new column called col_name.

if all values are equal to each other as in gr=2 output is all_equal

if two of the column has max value comparing to the third column like in gr=3 output column names A&C to the col_name.

I realized that it might be difficult to build a pipeline without gather

so I tried;

library(dplyr)
test%>%
  group_by(gr)%>%

  gather(variable, value, -gr) %>%
  arrange(gr)%>%
  mutate(col_name=variable[which.max(value)])

# A tibble: 18 x 4
# Groups:   gr [2]
   r    variable value col_name
   <fct> <chr>    <dbl> <chr>   
 1 1     A            5 A       
 2 1     A           NA A       
 3 1     A           NA A       
 4 1     B           NA A       
 5 1     B            2 A       
 6 1     B           NA A       
 7 1     C           NA A       
 8 1     C           NA A       
 9 1     C            1 A       
10 2     A            1 A       
11 2     A           NA A       
12 2     A           NA A       
13 2     B           NA A       
14 2     B            1 A       
15 2     B           NA A       
16 2     C           NA A       
17 2     C           NA A       
18 2     C            1 A 

The problem I am struggling in here is how to output all_equal comment if all the max values are equal in columns A,B and C and

if 2 columns max value equals (A and C in gr=3) outputting those equal column names in this format A&C in col_name

The expected output would be

> test
       A  B  C gr  col_name
    1  5 NA NA  1     A
    2 NA  2 NA  1     A
    3 NA NA  1  1     A
    4  1 NA NA  2  all_equal
    5 NA  1 NA  2  all_equal 
    6 NA NA  1  2  all_equal
    7  3 NA NA  3  A&C
    8 NA  1 NA  3  A&C 
    9 NA NA  3  3  A&C

thx in advance!

Upvotes: 5

Views: 596

Answers (5)

moodymudskipper
moodymudskipper

Reputation: 47350

Here are 2 base R solutions

Using split<- and aggregate

test0         <- aggregate(test[1:3], by = test[4], max,na.rm=T)[-1]
nms           <- apply(do.call(pmax,test0) == test0, 1 , function(x) names(which(x)))
test$col_name <- NA
split(test$col_name, test$gr) <- 
  ifelse(lengths(nms) == 3, "all_equal", lapply(nms,paste,collapse="&"))
test
# A  B  C gr  col_name
# 1  5 NA NA  1         A
# 2 NA  2 NA  1         A
# 3 NA NA  1  1         A
# 4  1 NA NA  2 all_equal
# 5 NA  1 NA  2 all_equal
# 6 NA NA  1  2 all_equal
# 7  3 NA NA  3       A&C
# 8 NA  1 NA  3       A&C
# 9 NA NA  3  3       A&C

details

We first coalesce test into test0

test0
#   A B C
# 1 5 2 1
# 2 1 1 1
# 3 3 1 3

Then get the max rowwise with pmax, the column indices with which and the column names with names.

We assign pasted names (with "all_equal" exception) with split<-, they're recycled to form the output.

Using stack and ave

stacked <- cbind(na.omit(stack(test,-gr)), gr=levels(test$gr))
test$col_name <-  with(stacked, ave(values, gr, FUN = function(x){
  nms <- paste(names(test)[which(x == max(x))],collapse="&")
  if (length(nms) == 3) "all_equal" else nms})[order(gr)])

#    A  B  C gr col_name
# 1  5 NA NA  1        A
# 2 NA  2 NA  1        A
# 3 NA NA  1  1        A
# 4  1 NA NA  2    A&B&C
# 5 NA  1 NA  2    A&B&C
# 6 NA NA  1  2    A&B&C
# 7  3 NA NA  3      A&C
# 8 NA  1 NA  3      A&C
# 9 NA NA  3  3      A&C

Upvotes: 0

AndS.
AndS.

Reputation: 8120

Similar to akrun's answer

library(tidyverse)

test <- data_frame(A=c(5,NA,NA,1,NA,NA,3,NA,NA),B=c(NA,2,NA,NA,1,NA,NA,1,NA),C=c(NA,NA,1,NA,NA,1,NA,NA,3),gr=gl(3,3))


test %>% gather(key, value, -gr, na.rm = TRUE) %>% 
  group_by(gr) %>% 
  arrange(gr) %>%
  mutate(col_name = if_else(length(which(value == max(value))) == length(unique(key)),
                     "all_equal",
                     paste(flatten(.[which(value == max(value)), "key"]), collapse = "&"))) %>% 
  spread(key, value)
#> # A tibble: 3 x 5
#> # Groups:   gr [3]
#>   gr    col_name      A     B     C
#>   <fct> <chr>     <dbl> <dbl> <dbl>
#> 1 1     A             5     2     1
#> 2 2     all_equal     1     1     1
#> 3 3     A&C           3     1     3

Created on 2018-08-09 by the reprex package (v0.2.0).

This will also condense the dataframe to remove all the extra NAs

Upvotes: 2

akrun
akrun

Reputation: 887851

Here is an option using gather. We gather the data into 'long' format, arrange by the group ('gr') and 'val' in descending order, grouped by 'gr', summarise to create the 'col_name' for each 'gr' based on the condition described in the OP's post and right_join with the original dataset

library(tidyverse)
test %>% 
   gather(key, val, -gr, na.rm = TRUE) %>%
   arrange(gr, desc(val)) %>%       
   group_by(gr) %>%         
   summarise(col_name = case_when(n_distinct(val)==1 ~ "all_equal",
                        TRUE ~ paste(key[val==max(val)], collapse = "&"))) %>% 
   right_join(test) %>%
   select(names(test), everything())
# A tibble: 9 x 5
#      A     B     C gr    col_name 
#  <dbl> <dbl> <dbl> <fct> <chr>    
#1     5    NA    NA 1     A        
#2    NA     2    NA 1     A        
#3    NA    NA     1 1     A        
#4     1    NA    NA 2     all_equal
#5    NA     1    NA 2     all_equal
#6    NA    NA     1 2     all_equal
#7     3    NA    NA 3     A&C      
#8    NA     1    NA 3     A&C      
#9    NA    NA     3 3     A&C      

or using data.table

library(data.table)
library(stringr)
setDT(test)[, col_name := {
      v1 <- sort(na.omit(unlist(.SD)), decreasing = TRUE)
      if(uniqueN(v1)==1) "all_equal" else 
     paste(str_remove(names(v1)[v1==max(v1)], "\\d+"), collapse="&")
    }, by = gr]

test
#    A  B  C gr  col_name
#1:  5 NA NA  1         A
#2: NA  2 NA  1         A
#3: NA NA  1  1         A
#4:  1 NA NA  2 all_equal
#5: NA  1 NA  2 all_equal
#6: NA NA  1  2 all_equal
#7:  3 NA NA  3       A&C
#8: NA  1 NA  3       A&C
#9: NA NA  3  3       A&C

Upvotes: 2

Luke C
Luke C

Reputation: 10336

Here is a dplyr approach that I tried to make a little more generalized to accommodate a different number of columns of interest. With your test data frame from above, start by defining a function that finds the max of the current group, gets indices for columns with matching values, then builds the output based on the number of matching columns:

foo <- function(df_, cols = 1:3) {
  # Get max
  m = max(df_[, cols], na.rm = TRUE)

  # Get columns
  ix <- as.data.frame(which(df_[, cols] == m, arr.ind = TRUE))[, 2]
  matchlen = length(ix)
  columns <- names(df_[,cols])[ix]

  # Get varname based on length
  out = ifelse(matchlen == length(cols), "all_equal", paste(columns, collapse = "&"))
  df_$col_name = out
  return(df_)
}

Because the output from that is a data frame, you need to make use of do to apply it to groups with dplyr:

test %>%
  group_by(gr) %>%
  do(foo(.))

# A tibble: 9 x 5
# Groups:   gr [3]
      A     B     C gr    col_name 
  <dbl> <dbl> <dbl> <fct> <chr>    
1     5    NA    NA 1     A        
2    NA     2    NA 1     A        
3    NA    NA     1 1     A        
4     1    NA    NA 2     all_equal
5    NA     1    NA 2     all_equal
6    NA    NA     1 2     all_equal
7     3    NA    NA 3     A&C      
8    NA     1    NA 3     A&C      
9    NA    NA     3 3     A&C 

The function should allow for a flexible number of columns to be input, as long as they're numeric. For example,

test %>%
  group_by(gr) %>%
  do(foo(., cols = 1:2))

and

test %>%
  group_by(gr) %>%
  do(foo(., cols = c(1,3)))

both seem to work.

Edit:

Yeah, I guess you can!

test %>%
  group_by(gr) %>%
  do(foo(., cols = c("A", "B", "C")))

Upvotes: 2

DanY
DanY

Reputation: 6073

A solution using a custom function and data.table:

myfun <- function(x) {
    x <- as.matrix(x)
    idx <- apply(x, 2, max, na.rm=T)==max(x, na.rm=T)
    who <- colnames(x)[idx]
    if(length(who)==1) return(who)
    if(length(who)==2) return(paste0(who, collapse = "&"))
    if(length(who)>2)  return("all_equal")
}

library(data.table)
dt <- data.table(test)

dt[ , new := myfun(cbind(A,B,C)), by=gr]

Notice that I had to use ==max() instead of which.max to deal with ties. I also struggled to implement this with a base function like aggregate so I went the data.table way.

Upvotes: 1

Related Questions