Reputation: 4298
I found this thread Find rows in dataframe with maximum values grouped by values in another column where one of the solution has been discussed. I am using this solution to recursively find the row index with maximum quantity. However, my solution is very ugly--very procedural instead of vectorized.
Here's my dummy data:
dput(Data)
structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999,
1999, 2000, 2000, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2002,
2002, 2002, 2002), Ship_Year = c(1997, 1998, 1999, 2000, 2001,
2002, NA, 1997, NA, 1997, 1998, 1999, 2000, 2001, 2002, NA, 1997,
1998, 1999, 2000), Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62,
627206.75998, 531087.43, 122167.02, 143855.55, 0, 0, 0, 0, 53650.389998,
17708416.3198, 98196.4, 31389, 0), Units = c(37, 1, 8, 5, 8,
8, 730, 99, 91, 195, 259, 4, 1, 3, 3, 53, 3844, 142, 63, 27)), .Names = c("Order_Year",
"Ship_Year", "Yen", "Units"), row.names = c(NA, 20L), class = "data.frame")
I want to find out the Ship_Year
for which Yen
and Units
are maximum for a given Order_Year
.
Here's what I tried:
a<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Yen), ]))
rownames(a)<-NULL
a$Yen<-NULL
a$Units<-NULL
#a has Ship_Year for which Yen is max for a given Order_Year
names(a)[2]<-"by.Yen"
#Now I'd find max year by units
b<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Units), ]))
rownames(b)<-NULL
b$Yen<-NULL
b$Units<-NULL
#b has Ship_Year for which Units is max for a given Order_Year
names(b)[2]<-"by.Qty"
c<-a %>% left_join(b)
The expected output is:
c
Order_Year by.Yen by.Qty
1 1999 NA NA
2 2000 1997 1997
3 2001 1998 1998
4 2002 1997 1997
While I got the expected output, the method above is very clunky. Is there a better way to handle this?
Upvotes: 2
Views: 1323
Reputation: 12723
using Base R
a1 <- with(df1,
by(data = df1,
INDICES = Order_Year,
FUN = function(x) list(Yen = x$Ship_Year[which.max(x$Yen)],
Units = x$Ship_Year[which.max(x$Units)])))
do.call("rbind", lapply(a1, function(x) data.frame(x)))
# Yen Units
# 1999 NA NA
# 2000 1997 1997
# 2001 1998 1998
# 2002 1997 1997
Data:
df1 <- structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999, 1999,
2000, 2000, 2001, 2001, 2001, 2001, 2001,
2001, 2001, 2002, 2002, 2002, 2002),
Ship_Year = c(1997, 1998, 1999, 2000, 2001, 2002, NA,
1997, NA, 1997, 1998, 1999, 2000, 2001,
2002, NA, 1997, 1998, 1999, 2000),
Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62, 627206.75998,
531087.43, 122167.02, 143855.55, 0, 0, 0, 0,
53650.389998, 17708416.3198, 98196.4, 31389, 0),
Units = c(37, 1, 8, 5, 8, 8, 730, 99, 91, 195, 259, 4,
1, 3, 3, 53, 3844, 142, 63, 27)),
.Names = c("Order_Year", "Ship_Year", "Yen", "Units"),
row.names = c(NA, 20L),
class = "data.frame")
Upvotes: 3
Reputation: 887511
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(Data)
), grouped by 'Order_Year', we get the index of the maximum value of 'Yen', 'Units' with match
, subset the corresponding values of 'Ship_Year' based on that index to return the summarised output
library(data.table)
setDT(Data)[,.(by.Yen = Ship_Year[match(max(Yen), Yen)],
by.Units = Ship_Year[match(max(Units), Units)]) , Order_Year]
# Order_Year by.Yen by.Units
#1: 1999 NA NA
#2: 2000 1997 1997
#3: 2001 1998 1998
#4: 2002 1997 1997
If there are many columns, instead of doing this separately, we can specify the columns of interest in .SDcols
, grouped by 'Order_Year', loop through the Subset of Data.table (.SD
) to get the index of maximum value, unlist
the list
output, subset the 'Ship_Year' based on that index, convert to list
(as.list
) and set the names of the columns to 'by.Yen' and 'by.Units'
setnames(setDT(Data)[, as.list(Ship_Year[unlist(lapply(.SD,
which.max))]), Order_Year, .SDcols = c("Yen", "Units")],
2:3, c("by.Yen", "by.Units"))[]
# Order_Year by.Yen by.Units
#1: 1999 NA NA
#2: 2000 1997 1997
#3: 2001 1998 1998
#4: 2002 1997 1997
Upvotes: 2
Reputation: 43354
which.max
works well with dplyr grouping:
library(dplyr)
Data %>% group_by(Order_Year) %>%
summarise(by.Yen = Ship_Year[which.max(Yen)],
by.Units = Ship_Year[which.max(Units)])
## # A tibble: 4 × 3
## Order_Year by.Yen by.Units
## <dbl> <dbl> <dbl>
## 1 1999 NA NA
## 2 2000 1997 1997
## 3 2001 1998 1998
## 4 2002 1997 1997
Upvotes: 4