Reputation: 1699
I don't quite understand how some of the groupings and summaries are built in R using dplyr package.
With the reproducible example below I'm trying to first group by (PN,GOT,HID) to count distinct instances of PC1. I then regroup by (PN,GOT) to sum over the distinct instances of PC1, based on the second grouping. This process seems to work for the total sums, except that for mean(TC) I get the mean of the entire data frame when I would expect to see the means by groupings of (PN,GOT). What am i missing to get those means of (PN,GOT), while not losing the sums over PC1 that I've built? I would appreciate some explanation of where I'm going wrong here.
PN<- c("Mazda","Mazda","Datsun","Hornet","Hornet","Valiant","Duster","Merc","Merc","Merc","Merc","Merc",
"Merc","Merc","Fiat","Honda","Toyota","Toyota","Dodge","AMC","Fiat")
GOT<- c("A","A","B","C","C","A","D","B","B","B","B","B","B","B","A","D","B","B","C","E","A")
HID<- c("Mazda_H1","Mazda_H1","Datsus_H1","Hornet_H1","Hornet_H2","Valiant_H1","Duster_H1","Merc_H1","Merc_H1","Merc_H1",
"Merc_H2","Merc_H2","Merc_H3","Merc_H4","Fiat_H1","Honda_H1","Toyota_H1","Toyota_H2","Dodge_H1","AMC_H1","Fiat_H1")
PIC<- c("BB","BB","BB","BB","AA","AA","AA","BA","BA","BA",
"AA","BB","BB","BB","BB","AA","AA","AA","BA","BA","BA")
TC <- c(110,110,93,175,175,105,245,62,62,62,62,62,62,62,33,52,97,97,150,150,33)
Int <- c(16.46,17.02,18.61,19.44,17.02,20.22,15.84,20.00,22.90,18.30,18.90,
17.40,17.60,18.00,19.47,18.52,19.90,20.01,16.87,17.30,18.90)
PC1<- c("", "","G1","C1","","G1","", "G1","G1","C1","C1","","","","Z1","Z1","Z1","Z1","","","G1")
df<-data.frame(PN,GOT,HID,PIC,TC,Int,PC1)
df
df%>% filter(PC1!="") %>%
group_by(PN, GOT, HID) %>%
summarize(new = n_distinct(PC1)) %>%
group_by(PN, GOT) %>%
mutate(TOT_new = sum(new),
meanTC = mean(TC))
I think the answer I'm looking for is something looking like this:
PN GOT HID TOT_new meanTC
<fctr> <fctr> <fctr> <int> <dbl>
1 Datsun B Datsus_H1 1 93
2 Fiat A Fiat_H1 2 33
3 Honda D Honda_H1 1 52
4 Hornet C Hornet_H1 1 175
5 Merc B Merc_H1 3 62
6 Toyota B Toyota_H1 2 97
7 Valiant A Valiant_H1 1 105
or at least this:
PN GOT HID new TOT_new meanTC
<fctr> <fctr> <fctr> <int> <int> <dbl>
1 Datsun B Datsus_H1 1 1 93
2 Fiat A Fiat_H1 2 2 33
3 Honda D Honda_H1 1 1 52
4 Hornet C Hornet_H1 1 1 175
5 Merc B Merc_H1 2 3 62
6 Merc B Merc_H2 1 3 62
7 Toyota B Toyota_H1 1 2 97
8 Toyota B Toyota_H2 1 2 97
9 Valiant A Valiant_H1 1 1 105
Upvotes: 1
Views: 1984
Reputation: 214957
As commented by @r2evans, the reason you get a global mean is TC column is dropped during the first summarize stage. Besides the join
option suggested in the comment, you can also pass the TC
column info forward in the first summary stage by calculating two intermediate variables:
df %>% filter(PC1 != "") %>%
group_by(PN, GOT, HID) %>%
# create two columns with the sum and length of TC in each group which you can use later
# for average calculation
summarize(new = n_distinct(PC1), n = n(), TC_sum = sum(TC)) %>%
group_by(PN, GOT) %>%
summarise(TOT_new = sum(new), meanTC = sum(TC_sum)/sum(n))
# Source: local data frame [7 x 4]
# Groups: PN [?]
# PN GOT TOT_new meanTC
# <fctr> <fctr> <int> <dbl>
#1 Datsun B 1 93
#2 Fiat A 2 33
#3 Honda D 1 52
#4 Hornet C 1 175
#5 Merc B 3 62
#6 Toyota B 2 97
#7 Valiant A 1 105
Upvotes: 2
Reputation: 887118
We can also use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df)
), specify the logical condition in 'i' (PC1 != ""
), grouped by 'PN', 'GOT', 'HID', we get the length
of unique
elements of 'PC1 ('new'), number of elements per group (.N
), and sum
of 'TC', then grouped by 'PN', 'GOT', we assign the sum
of 'new' and ratio of sum
of 'TC_sum' with sum
of 'n' to 'TOT_new' and 'meanTC'. Assign the columns that are not needed to NULL
library(data.table)
setDT(df)[PC1 != "", .(new = uniqueN(PC1), n = .N, TC_sum = sum(TC)) ,.(PN, GOT, HID)
][, c("TOT_new", "meanTC") := .(sum(new), sum(TC_sum)/sum(n)) ,.(PN, GOT)
][, c("n", "TC_sum") := NULL][]
# PN GOT HID new TOT_new meanTC
#1: Datsun B Datsus_H1 1 1 93
#2: Hornet C Hornet_H1 1 1 175
#3: Valiant A Valiant_H1 1 1 105
#4: Merc B Merc_H1 2 3 62
#5: Merc B Merc_H2 1 3 62
#6: Fiat A Fiat_H1 2 2 33
#7: Honda D Honda_H1 1 1 52
#8: Toyota B Toyota_H1 1 2 97
#9: Toyota B Toyota_H2 1 2 97
Upvotes: 1
Reputation: 1699
This is one way to make it work, based on some of the comments above. But it looks redundant.
df%>% filter(PC1!="") %>%
group_by(PN, GOT, HID) %>%
summarize(new = n_distinct(PC1),
meanTC = mean(TC)) %>%
group_by(PN, GOT) %>%
mutate(TOT_new = sum(new),
meanTC = mean(meanTC))%>%
select(-HID)
PN GOT new meanTC TOT_new
<fctr> <fctr> <int> <dbl> <int>
1 Datsun B 1 93 1
2 Fiat A 2 33 2
3 Honda D 1 52 1
4 Hornet C 1 175 1
5 Merc B 2 62 3
6 Merc B 1 62 3
7 Toyota B 1 97 2
8 Toyota B 1 97 2
9 Valiant A 1 105 1
Upvotes: 0