cyrusjan
cyrusjan

Reputation: 647

How to sum up the duplicated value while keep the other columns?

I have a dataset like this

    V0  V1  V2  V3   X     Y  
#1  1   A   21  31   123   12
#2  2   A   21  31   245   24
#3  3   B   22  32   234   25
#4  4   C   23  33   190   30
#5  5   C   23  33   210   20

So there are duplicated value in V1, V2 and V3; and I would like to create a dataset like this, which sum up X and Y respectively, by V1-V3

    V1  V2  V3   X     Y    V
#1  A   21  31   368   36   1,2
#2  B   22  32   234   25   3
#3  C   23  33   400   50   4,5

I was trying "aggregate" as below, but have no idea how to do it with both X and Y together, and at the same time, keep V2 and V3 as the original value, instead of sum them. And how can I put the value in V0 within the same group (V1) in another new variable.

 df.sum <- aggregate(X~V1,data=df,FUN=sum)

I try to merge the "df.sum" with "df" by "V1" afterwards, but it turns out that all the duplicated value were merged as well.

Any suggestion? Thank you very much!

Upvotes: 1

Views: 2148

Answers (2)

KFB
KFB

Reputation: 3501

Or by dplyr

library(dplyr)
df %>% group_by(V1,V2,V3) %>% summarise(X_sum=sum(X), Y_sum= sum(Y))
# Or as suggested, you could also do:
df %>% group_by(V1,V2,V3) %>% summarise_each(funs(sum))

#Source: local data frame [3 x 5]
#Groups: V1, V2
#
#  V1 V2 V3 X_sum Y_sum
#1  A 21 31   368    36
#2  B 22 32   234    25
#3  C 23 33   400    50

# data
df <- structure(list(V1 = structure(c(1L, 1L, 2L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), V2 = c(21L, 21L, 22L, 23L, 23L), 
    V3 = c(31L, 31L, 32L, 33L, 33L), X = c(123L, 245L, 234L, 
    190L, 210L), Y = c(12L, 24L, 25L, 30L, 20L)), .Names = c("V1", 
"V2", "V3", "X", "Y"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5"))

Regarding the updated data, you could do something like:

df %>% group_by(V1,V2,V3) %>%
  summarise_each(funs(sum, toString), X, Y, V0) %>%
  select(-V0_sum,
         -X_toString,
         -Y_toString)
# you get
#   V1 V2 V3 X_sum Y_sum V0_toString
# 1  A 21 31   368    36        1, 2
# 2  B 22 32   234    25           3
# 3  C 23 33   400    50        4, 5

# data
df <- structure(list(V0 = 1:5, V1 = structure(c(1L, 1L, 2L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), V2 = c(21L, 21L, 22L, 23L, 23L), 
    V3 = c(31L, 31L, 32L, 33L, 33L), X = c(123L, 245L, 234L, 
    190L, 210L), Y = c(12L, 24L, 25L, 30L, 20L)), .Names = c("V0", 
"V1", "V2", "V3", "X", "Y"), class = "data.frame", row.names = c(NA, 
-5L))

Upvotes: 4

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193677

You're on the right track. Just do:

aggregate(. ~ V1 + V2 + V2, mydf, sum)
#   V1 V2 V3   X  Y
# 1  A 21 62 368 36
# 2  B 22 32 234 25
# 3  C 23 66 400 50

You can also do this in many other ways. For example, here's an approach using "data.table":

library(data.table)
as.data.table(mydf)[, lapply(.SD, sum), by = list(V1, V2, V3)]

Upvotes: 3

Related Questions