val
val

Reputation: 1699

In R, unexpected result from using group_by() and summarise() in dplyr

I don't quite understand how some of the groupings and summaries are built in R using dplyr package.

With the reproducible example below I'm trying to first group by (PN,GOT,HID) to count distinct instances of PC1. I then regroup by (PN,GOT) to sum over the distinct instances of PC1, based on the second grouping. This process seems to work for the total sums, except that for mean(TC) I get the mean of the entire data frame when I would expect to see the means by groupings of (PN,GOT). What am i missing to get those means of (PN,GOT), while not losing the sums over PC1 that I've built? I would appreciate some explanation of where I'm going wrong here.

PN<- c("Mazda","Mazda","Datsun","Hornet","Hornet","Valiant","Duster","Merc","Merc","Merc","Merc","Merc",
       "Merc","Merc","Fiat","Honda","Toyota","Toyota","Dodge","AMC","Fiat")
GOT<- c("A","A","B","C","C","A","D","B","B","B","B","B","B","B","A","D","B","B","C","E","A")
HID<- c("Mazda_H1","Mazda_H1","Datsus_H1","Hornet_H1","Hornet_H2","Valiant_H1","Duster_H1","Merc_H1","Merc_H1","Merc_H1",
        "Merc_H2","Merc_H2","Merc_H3","Merc_H4","Fiat_H1","Honda_H1","Toyota_H1","Toyota_H2","Dodge_H1","AMC_H1","Fiat_H1")
PIC<- c("BB","BB","BB","BB","AA","AA","AA","BA","BA","BA",
        "AA","BB","BB","BB","BB","AA","AA","AA","BA","BA","BA")
TC <- c(110,110,93,175,175,105,245,62,62,62,62,62,62,62,33,52,97,97,150,150,33)
Int <- c(16.46,17.02,18.61,19.44,17.02,20.22,15.84,20.00,22.90,18.30,18.90,
         17.40,17.60,18.00,19.47,18.52,19.90,20.01,16.87,17.30,18.90)
PC1<- c("", "","G1","C1","","G1","", "G1","G1","C1","C1","","","","Z1","Z1","Z1","Z1","","","G1")

df<-data.frame(PN,GOT,HID,PIC,TC,Int,PC1)

df

df%>% filter(PC1!="") %>%
  group_by(PN, GOT, HID) %>%
  summarize(new = n_distinct(PC1)) %>%
  group_by(PN, GOT) %>%
  mutate(TOT_new = sum(new),
            meanTC = mean(TC))

I think the answer I'm looking for is something looking like this:

       PN    GOT        HID   TOT_new meanTC
   <fctr> <fctr>     <fctr>   <int>  <dbl>
1  Datsun      B  Datsus_H1     1     93
2    Fiat      A    Fiat_H1     2     33
3   Honda      D   Honda_H1     1     52
4  Hornet      C  Hornet_H1     1    175
5    Merc      B    Merc_H1     3     62
6  Toyota      B  Toyota_H1     2     97
7 Valiant      A Valiant_H1     1    105

or at least this:

       PN    GOT        HID   new TOT_new meanTC
   <fctr> <fctr>     <fctr> <int>   <int>  <dbl>
1  Datsun      B  Datsus_H1     1       1     93
2    Fiat      A    Fiat_H1     2       2     33
3   Honda      D   Honda_H1     1       1     52
4  Hornet      C  Hornet_H1     1       1    175
5    Merc      B    Merc_H1     2       3     62
6    Merc      B    Merc_H2     1       3     62
7  Toyota      B  Toyota_H1     1       2     97
8  Toyota      B  Toyota_H2     1       2     97
9 Valiant      A Valiant_H1     1       1    105

Upvotes: 1

Views: 1984

Answers (3)

akuiper
akuiper

Reputation: 214957

As commented by @r2evans, the reason you get a global mean is TC column is dropped during the first summarize stage. Besides the join option suggested in the comment, you can also pass the TC column info forward in the first summary stage by calculating two intermediate variables:

df %>% filter(PC1 != "") %>%

    group_by(PN, GOT, HID) %>%
    # create two columns with the sum and length of TC in each group which you can use later
    # for average calculation
    summarize(new = n_distinct(PC1), n = n(), TC_sum = sum(TC)) %>%

    group_by(PN, GOT) %>%
    summarise(TOT_new = sum(new), meanTC = sum(TC_sum)/sum(n))

# Source: local data frame [7 x 4]
# Groups: PN [?]

#       PN    GOT TOT_new meanTC
#   <fctr> <fctr>   <int>  <dbl>
#1  Datsun      B       1     93
#2    Fiat      A       2     33
#3   Honda      D       1     52
#4  Hornet      C       1    175
#5    Merc      B       3     62
#6  Toyota      B       2     97
#7 Valiant      A       1    105

Upvotes: 2

akrun
akrun

Reputation: 887118

We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), specify the logical condition in 'i' (PC1 != ""), grouped by 'PN', 'GOT', 'HID', we get the length of uniqueelements of 'PC1 ('new'), number of elements per group (.N), and sum of 'TC', then grouped by 'PN', 'GOT', we assign the sum of 'new' and ratio of sum of 'TC_sum' with sum of 'n' to 'TOT_new' and 'meanTC'. Assign the columns that are not needed to NULL

library(data.table)
setDT(df)[PC1 != "", .(new = uniqueN(PC1), n = .N, TC_sum = sum(TC)) ,.(PN, GOT, HID)
       ][, c("TOT_new", "meanTC") := .(sum(new), sum(TC_sum)/sum(n)) ,.(PN, GOT)
         ][, c("n", "TC_sum") := NULL][]
#        PN GOT        HID new TOT_new meanTC
#1:  Datsun   B  Datsus_H1   1       1     93
#2:  Hornet   C  Hornet_H1   1       1    175
#3: Valiant   A Valiant_H1   1       1    105
#4:    Merc   B    Merc_H1   2       3     62
#5:    Merc   B    Merc_H2   1       3     62
#6:    Fiat   A    Fiat_H1   2       2     33
#7:   Honda   D   Honda_H1   1       1     52
#8:  Toyota   B  Toyota_H1   1       2     97
#9:  Toyota   B  Toyota_H2   1       2     97

Upvotes: 1

val
val

Reputation: 1699

This is one way to make it work, based on some of the comments above. But it looks redundant.

df%>% filter(PC1!="") %>%
  group_by(PN, GOT, HID) %>%
  summarize(new = n_distinct(PC1),
            meanTC = mean(TC)) %>%
  group_by(PN, GOT) %>%
  mutate(TOT_new = sum(new),
         meanTC = mean(meanTC))%>%
  select(-HID)

       PN    GOT   new meanTC TOT_new
   <fctr> <fctr> <int>  <dbl>   <int>
1  Datsun      B     1     93       1
2    Fiat      A     2     33       2
3   Honda      D     1     52       1
4  Hornet      C     1    175       1
5    Merc      B     2     62       3
6    Merc      B     1     62       3
7  Toyota      B     1     97       2
8  Toyota      B     1     97       2
9 Valiant      A     1    105       1

Upvotes: 0

Related Questions