Reputation: 1015
I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:
data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med"))
d<- data %>%
group_by(b) %>%
summarise(min_values= min(c))
d
b min_values
1 a 1.2
2 b 1.7
3 c 3.1
4 d 2.2
So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.
I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.
Upvotes: 95
Views: 133543
Reputation: 51994
With dplyr 1.1.0
, you can use .by
in mutate
, summarize
, filter
and slice
to do temporary grouping. With mutate
, all rows and columns are kept:
data %>%
mutate(min_values = min(c), .by = b)
With filter
, or slice
, rows are summarized and all columns are kept:
data %>%
slice_min(c, .by = b)
data %>%
filter(c = min(c), .by = b)
Upvotes: 2
Reputation: 13570
Using sqldf
:
library(sqldf)
# Two options:
sqldf('SELECT * FROM data GROUP BY b HAVING min(c)')
sqldf('SELECT a, b, min(c) min, d FROM data GROUP BY b')
Output:
a b c d
1 1 a 1.2 small
2 4 b 1.7 larg
3 6 c 3.1 med
4 10 d 2.2 med
Upvotes: 3
Reputation: 70266
Here are two options using a) filter
and b) slice
from dplyr. In this case there are no duplicated minimum values in column c
for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.
a)
> data %>% group_by(b) %>% filter(c == min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
# a b c d
#1 1 a 1.2 small
#2 4 b 1.7 larg
#3 6 c 3.1 med
#4 10 d 2.2 med
Or similarly
> data %>% group_by(b) %>% filter(min_rank(c) == 1L)
#Source: local data frame [4 x 4]
#Groups: b
#
# a b c d
#1 1 a 1.2 small
#2 4 b 1.7 larg
#3 6 c 3.1 med
#4 10 d 2.2 med
b)
> data %>% group_by(b) %>% slice(which.min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
# a b c d
#1 1 a 1.2 small
#2 4 b 1.7 larg
#3 6 c 3.1 med
#4 10 d 2.2 med
Upvotes: 63
Reputation: 7232
You can use group_by
without summarize
:
data %>%
group_by(b) %>%
mutate(min_values = min(c)) %>%
ungroup()
Upvotes: 75