Reputation: 1015

Applying group_by and summarise on data while keeping all the columns' info

I have a large dataset with 22000 rows and 25 columns. I am trying to group my dataset based on one of the columns and take the min value of the other column based on the grouped dataset. However, the problem is that it only gives me two columns containing the grouped column and the column having the min value... but I need all the information of other columns related to the rows with the min values. Here is a simple example just to make it reproducible:

    data<- data.frame(a=1:10, b=c("a","a","a","b","b","c","c","d","d","d"), c=c(1.2, 2.2, 2.4, 1.7, 2.7, 3.1, 3.2, 4.2, 3.3, 2.2), d= c("small", "med", "larg", "larg", "larg", "med", "small", "small", "small", "med"))

    d<- data %>%
    group_by(b) %>%
    summarise(min_values= min(c))
    d
    b min_values
    1 a        1.2
    2 b        1.7
    3 c        3.1
    4 d        2.2

So, I need to have also the information related to columns a and d, however, since I have duplications in the values in column c I cannot merge them based on the min_value column... I was wondering if there is any way to keep other columns' information when we are using dplyr package.

I have found some explanation here "dplyr: group_by, subset and summarise" and here "Finding percentage in a sub-group using group_by and summarise" but none of the addresses my problem.

Upvotes: 95

Answers (4)

Maël

Reputation: 51994

With dplyr 1.1.0, you can use .by in mutate, summarize, filter and slice to do temporary grouping. With mutate, all rows and columns are kept:

data %>% 
  mutate(min_values = min(c), .by = b)

With filter, or slice, rows are summarized and all columns are kept:

data %>% 
  slice_min(c, .by = b)

data %>% 
  filter(c = min(c), .by = b)

Upvotes: 2

mpalanco

Reputation: 13570

Using sqldf:

library(sqldf)
 # Two options:
sqldf('SELECT * FROM data GROUP BY b HAVING min(c)')
sqldf('SELECT a, b, min(c) min, d FROM data GROUP BY b')

Output:

   a b   c     d
1  1 a 1.2 small
2  4 b 1.7  larg
3  6 c 3.1   med
4 10 d 2.2   med

Upvotes: 3

talat

Reputation: 70266

Here are two options using a) filter and b) slice from dplyr. In this case there are no duplicated minimum values in column c for any of the groups and so the results of a) and b) are the same. If there were duplicated minima, approach a) would return each minima per group while b) would only return one minimum (the first) in each group.

> data %>% group_by(b) %>% filter(c == min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

Or similarly

> data %>% group_by(b) %>% filter(min_rank(c) == 1L)
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

> data %>% group_by(b) %>% slice(which.min(c))
#Source: local data frame [4 x 4]
#Groups: b
#
#   a b   c     d
#1  1 a 1.2 small
#2  4 b 1.7  larg
#3  6 c 3.1   med
#4 10 d 2.2   med

Upvotes: 63

bergant

Reputation: 7232

You can use group_by without summarize:

data %>%
  group_by(b) %>%
  mutate(min_values = min(c)) %>%
  ungroup()

Upvotes: 75

Applying group_by and summarise on data while keeping all the columns&#39; info

Answers (4)

Related Questions

Applying group_by and summarise on data while keeping all the columns' info