watchtower
watchtower

Reputation: 4298

Finding index corresponding to the maximum value

I found this thread Find rows in dataframe with maximum values grouped by values in another column where one of the solution has been discussed. I am using this solution to recursively find the row index with maximum quantity. However, my solution is very ugly--very procedural instead of vectorized.

Here's my dummy data:

dput(Data)

structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999, 
1999, 2000, 2000, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2002, 
2002, 2002, 2002), Ship_Year = c(1997, 1998, 1999, 2000, 2001, 
2002, NA, 1997, NA, 1997, 1998, 1999, 2000, 2001, 2002, NA, 1997, 
1998, 1999, 2000), Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62, 
627206.75998, 531087.43, 122167.02, 143855.55, 0, 0, 0, 0, 53650.389998, 
17708416.3198, 98196.4, 31389, 0), Units = c(37, 1, 8, 5, 8, 
8, 730, 99, 91, 195, 259, 4, 1, 3, 3, 53, 3844, 142, 63, 27)), .Names = c("Order_Year", 
"Ship_Year", "Yen", "Units"), row.names = c(NA, 20L), class = "data.frame")

I want to find out the Ship_Year for which Yen and Units are maximum for a given Order_Year.

Here's what I tried:

a<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Yen), ]))
rownames(a)<-NULL
a$Yen<-NULL
a$Units<-NULL
#a has Ship_Year for which Yen is max for a given Order_Year
names(a)[2]<-"by.Yen" 
#Now I'd find max year by units
b<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Units), ]))
rownames(b)<-NULL
b$Yen<-NULL
b$Units<-NULL
#b has Ship_Year for which Units is max for a given Order_Year
names(b)[2]<-"by.Qty"
c<-a %>% left_join(b)

The expected output is:

c
  Order_Year by.Yen by.Qty
1       1999     NA     NA
2       2000   1997   1997
3       2001   1998   1998
4       2002   1997   1997

While I got the expected output, the method above is very clunky. Is there a better way to handle this?

Upvotes: 2

Views: 1323

Answers (3)

Sathish
Sathish

Reputation: 12723

using Base R

a1 <- with(df1,
           by(data    = df1,
              INDICES = Order_Year, 
              FUN     = function(x) list(Yen   = x$Ship_Year[which.max(x$Yen)],
                                         Units = x$Ship_Year[which.max(x$Units)])))

do.call("rbind", lapply(a1, function(x) data.frame(x)))
#       Yen Units
# 1999   NA    NA
# 2000 1997  1997
# 2001 1998  1998
# 2002 1997  1997

Data:

df1 <- structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999, 1999,
                                     2000, 2000, 2001, 2001, 2001, 2001, 2001,
                                     2001, 2001, 2002, 2002, 2002, 2002),
                      Ship_Year = c(1997, 1998, 1999, 2000, 2001, 2002, NA, 
                                    1997, NA, 1997, 1998, 1999, 2000, 2001, 
                                    2002, NA, 1997, 1998, 1999, 2000),
                      Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62, 627206.75998, 
                              531087.43, 122167.02, 143855.55, 0, 0, 0, 0,
                              53650.389998, 17708416.3198, 98196.4, 31389, 0), 
                      Units = c(37, 1, 8, 5, 8, 8, 730, 99, 91, 195, 259, 4,
                                1, 3, 3, 53, 3844, 142, 63, 27)), 
                 .Names = c("Order_Year", "Ship_Year", "Yen", "Units"), 
                 row.names = c(NA, 20L),
                 class = "data.frame")

Upvotes: 3

akrun
akrun

Reputation: 887511

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(Data)), grouped by 'Order_Year', we get the index of the maximum value of 'Yen', 'Units' with match, subset the corresponding values of 'Ship_Year' based on that index to return the summarised output

library(data.table)
setDT(Data)[,.(by.Yen = Ship_Year[match(max(Yen), Yen)],
        by.Units = Ship_Year[match(max(Units), Units)]) , Order_Year]
#   Order_Year by.Yen by.Units
#1:       1999     NA       NA
#2:       2000   1997     1997
#3:       2001   1998     1998
#4:       2002   1997     1997

If there are many columns, instead of doing this separately, we can specify the columns of interest in .SDcols, grouped by 'Order_Year', loop through the Subset of Data.table (.SD) to get the index of maximum value, unlist the list output, subset the 'Ship_Year' based on that index, convert to list (as.list) and set the names of the columns to 'by.Yen' and 'by.Units'

setnames(setDT(Data)[,  as.list(Ship_Year[unlist(lapply(.SD, 
  which.max))]), Order_Year, .SDcols = c("Yen", "Units")], 
                2:3, c("by.Yen", "by.Units"))[]
#    Order_Year by.Yen by.Units
#1:       1999     NA       NA
#2:       2000   1997     1997
#3:       2001   1998     1998
#4:       2002   1997     1997

Upvotes: 2

alistaire
alistaire

Reputation: 43354

which.max works well with dplyr grouping:

library(dplyr)

Data %>% group_by(Order_Year) %>% 
    summarise(by.Yen = Ship_Year[which.max(Yen)], 
              by.Units = Ship_Year[which.max(Units)])

## # A tibble: 4 × 3
##   Order_Year by.Yen by.Units
##        <dbl>  <dbl>    <dbl>
## 1       1999     NA       NA
## 2       2000   1997     1997
## 3       2001   1998     1998
## 4       2002   1997     1997

Upvotes: 4

Related Questions