N.Varela
N.Varela

Reputation: 910

R: How to aggregate (and sum up) rows in df according to multiple column criteria and keep previous order?

I want to aggregate rows (and sum up values) in the following data example:

    df <- data.frame(from=c("A" ,"A", "A", "C", "C", "D", "A"),
                 to=c("B", "B", "B", "A", "A", "B", "D"),
                 values=c(5,6,2,10,2,6,3),
                 product=c("x","x", "x", "y", "z", "w", "w"),
                 year=c(1990,1991,1991,1990,1990,1991,1992))

> df
  from to values product year
1    A  B      5       x 1990
2    A  B      6       x 1991
3    A  B      2       x 1991
4    C  A     10       y 1990
5    C  A      2       z 1990
6    D  B      6       w 1991
7    A  D      3       w 1992

All rows containing same values/characters for the columns from, to, product and year should be aggregated to one row and the values in the values column should be summed up.

I tried the following code:

aggregate(values~from+to+product+year, df, FUN=sum)

and

ddply(df_id, c("from", "to", "product", "year"), numcolwise(sum))

these codes worked well. However, both changed the order of rows (also columns which is less important) see below:

for aggregate:
      from to product year values
    1    A  B       x 1990      5
    2    C  A       y 1990     10
    3    C  A       z 1990      2
    4    D  B       w 1991      6
    5    A  B       x 1991      8
    6    A  D       w 1992      3

and for ddply:

       from    to product year values
    1     C     A       y 1990     10
    2     C     A       z 1990      2
    3     A     B       x 1990      5
    4     A     B       x 1991      8
    5     A     D       w 1992      3
    6     D     B       w 1991      6

the expected result should look like:

  from to values product year
1    A  B      5       x 1990
2    A  B      8       x 1991
3    C  A     10       y 1990
4    C  A      2       z 1990
5    D  B      6       w 1991
6    A  D      3       w 1992

Any ideas how to solve this order problem (at least for rows)? Thanks

Upvotes: 2

Views: 884

Answers (2)

Frank
Frank

Reputation: 66819

The data.table package retains original by-group ordering by default:

library(data.table)

setDT(df)[, .(v = sum(values)), by=.(from,to,product,year)] 

#    from to product year  v
# 1:    A  B       x 1990  5
# 2:    A  B       x 1991  8
# 3:    C  A       y 1990 10
# 4:    C  A       z 1990  2
# 5:    D  B       w 1991  6
# 6:    A  D       w 1992  3

Only with keyby= instead of by= would it sort groups.

Upvotes: 4

akrun
akrun

Reputation: 887541

Here is one way with dplyr. We replace the 'values' by the sum after grouping by the variables and get the distinct rows.

library(dplyr)
df %>% 
   group_by(from, to, product, year) %>% 
   mutate(values=sum(values)) %>% 
   distinct
    from     to values product  year
#  (fctr) (fctr)  (dbl)  (fctr) (dbl)
#1      A      B      5       x  1990
#2      A      B      8       x  1991
#3      C      A     10       y  1990
#4      C      A      2       z  1990
#5      D      B      6       w  1991
#6      A      D      3       w  1992

EDIT: replaced unique with distinct based on @Steven Beaupre's comment.

Upvotes: 2

Related Questions