Reputation: 1611
I have the following data:
Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")
data <- data.frame(Name, Age, Group)
And I'd like to use dplyr to
(1) group the data by "Group" (2) show the min and max Age within each Group (3) show the Name of the person with the min and max ages
The following code does this:
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))],
maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Which works well:
Group minAge minAgeName maxAge maxAgeName
1 A 22 Sam 22 Sam
2 B 12 Sarah 58 James
3 C 17 Andrew 82 Sally
4 D 12 Mairin 67 Ray
However, I have a problem if there are multiple min or max values:
Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,31,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")
data <- data.frame(Name, Age, Group)
> data %>% group_by(Group) %>%
+ summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))],
+ maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Error: expecting a single value
I'm looking for two solutions:
(1) where it doesn't matter which min or max name is shown, just that one is shown (i.e., the first value found) (2) where if there are "ties" all minimum values and maximum values are shown
Please let me know if this isn't clear and thanks in advance!
Upvotes: 28
Views: 80606
Reputation: 66819
Here are some data.table
approaches, the first one borrowed from @akrun:
setDT(data)
# show one, wide format
data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
# Group min.Name min.Age max.Name max.Age
# 1: A Sam 22 Sam 22
# 2: B Sarah 31 James 58
# 3: C Andrew 17 Sally 82
# 4: D Mairin 12 Ray 67
# show all, long format
data[,{
mina=min(Age)
maxa=max(Age)
rbind(
data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
)},by=Group]
# Group minmax Age Name
# 1: A min 22 Sam
# 2: A max 22 Sam
# 3: B min 31 Sarah
# 4: B min 31 Jim
# 5: B max 58 James
# 6: C min 17 Andrew
# 7: C max 82 Sally
# 8: D min 12 Mairin
# 9: D max 67 Ray
I think the long format is the best, since it allows you to filter with minmax
, but the code is tortured and inefficient.
Here are some arguably less good ways:
# show all, wide format (with a list column)
data[,{
mina=min(Age)
maxa=max(Age)
list(
minAge=mina,
maxAge=maxa,
minNames=list(Name[Age==mina]),
maxNames=list(Name[Age==maxa]))
},by=Group]
# Group minAge maxAge minNames maxNames
# 1: A 22 22 Sam Sam
# 2: B 31 58 Sarah,Jim James
# 3: C 17 82 Andrew Sally
# 4: D 12 67 Mairin Ray
# show all, wide format (with a string column)
# (just look at @shadow's answer)
Upvotes: 4
Reputation: 193687
I would actually recommend keeping your data in a "long" format. Here's how I would approach this:
library(dplyr)
Keeping all values when there are ties:
data %>%
group_by(Group) %>%
arrange(Age) %>% ## optional
filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 Jim 31 B
# 4 James 58 B
# 5 Andrew 17 C
# 6 Sally 82 C
# 7 Mairin 12 D
# 8 Ray 67 D
Keeping only one value when there are ties:
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 James 58 B
# 4 Andrew 17 C
# 5 Sally 82 C
# 6 Mairin 12 D
# 7 Ray 67 D
If you really want a "wide" dataset, the basic concept would be to gather
and spread
the data, using "tidyr":
library(dplyr)
library(tidyr)
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(c(1, n())) %>%
mutate(minmax = c("min", "max")) %>%
gather(var, val, Name:Age) %>%
unite(key, minmax, var) %>%
spread(key, val)
# Source: local data frame [4 x 5]
#
# Group max_Age max_Name min_Age min_Name
# 1 A 22 Sam 22 Sam
# 2 B 58 James 31 Sarah
# 3 C 82 Sally 17 Andrew
# 4 D 67 Ray 12 Mairin
Though what wide form you would want with ties is unclear.
Upvotes: 14
Reputation: 22353
You can use which.min
and which.max
to get the first value.
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = Name[which.min(Age)],
maxAge = max(Age), maxAgeName = Name[which.max(Age)])
To get all values, use e.g. paste with an appropriate collapse
argument.
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "),
maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))
Upvotes: 37