user2205916
user2205916

Reputation: 3456

R & dplyr: how to include more columns from the original data frame/data table after summarise()?

Phrased differently, how do you aggregate one column (e.g. the column) while keeping another the same (e.g. the location)?

This MWE illustrates my problem. How would I add back in the location column after doing a summarise()? Is there some solution that involves "going up a level," before the summarise() so that I can maintain the original columns?

test <- as.data.table(data.frame(event_id = c("A","B","A","A","B"),
                                 income = c(1,2,3,4,5),
                                 location = c("PlaceX","PlaceY","PlaceX","PlaceX","PlaceY")))

test

   event_id income location
1:        A      1   PlaceX
2:        B      2   PlaceY
3:        A      3   PlaceX
4:        A      4   PlaceX
5:        B      5   PlaceY

test %>%
  group_by(event_id) %>%
  summarise(mean_inc = mean(income))

Source: local data table [2 x 2]

  event_id mean_inc
    (fctr)    (dbl)
1        A 2.666667
2        B 3.500000

The following doesn't work:

test %>%
  group_by(event_id) %>%
  summarise(mean_inc = mean(income),
  location = location)

Source: local data table [5 x 3]

  event_id mean_inc location
    (fctr)    (dbl)   (fctr)
1        A 2.666667   PlaceX
2        A 2.666667   PlaceX
3        A 2.666667   PlaceX
4        B 3.500000   PlaceY
5        B 3.500000   PlaceY

My desired output is:

Source: local data table [2 x 3]

  event_id location mean_inc
    (fctr)   (fctr)    (dbl)
1        A   PlaceX 2.666667
2        B   PlaceY 3.500000

Upvotes: 1

Views: 690

Answers (2)

aosmith
aosmith

Reputation: 36076

An option might be to use mutate and then pull out just one value per group via distinct.

The usefulness of this would depend on the real use case: this seems most useful if your new variable has the same name as the original variable it summarizes. Otherwise you would end up having your original, unsummarized variable in the final dataset.

distinct works here because the object is still grouped.

test %>% 
    group_by(event_id) %>%
    mutate(income = mean(income)) %>%
    distinct()

Source: local data table [2 x 3]

  event_id   income location
    (fctr)    (dbl)   (fctr)
1        A 2.666667   PlaceX
2        B 3.500000   PlaceY

In dplyr_0.4.3.9000 you would need .keep_all = TRUE in distinct.

Upvotes: 1

IRTFM
IRTFM

Reputation: 263342

I hope I have understood your desires. Do an inner_join to recover the missing columns (assuming they are matched 1-1 with the group_by argument):

 newtest <- test %>%
   group_by(event_id) %>%
   summarise(mean_inc = mean(income)) %>% inner_join(test[-(1:2)])
#Joining by: "event_id"
 newtest
#-----------------
Source: local data table [3 x 4]

  event_id mean_inc income location
    (fctr)    (dbl)  (dbl)   (fctr)
1        A 2.666667      3   PlaceX
2        A 2.666667      4   PlaceX
3        B 3.500000      5   PlaceY

It's also possible you wanted to match on both event_id and location:

  test %>%
   group_by(event_id,location) %>%
   summarise(mean_inc = mean(income))
#---------
#Source: local data table [2 x 3]
#Groups: event_id

  event_id location mean_inc
    (fctr)   (fctr)    (dbl)
1        A   PlaceX 2.666667
2        B   PlaceY 3.500000

Upvotes: 1

Related Questions