paulburg
paulburg

Reputation: 101

Sum up rows according to specific values

That's my data frame

Colour = c("red",   "blue", "red",  "blue", "yellow",   "green",    "red",  "blue", "green",    "red",  "yellow",   "blue")
Volume  = c(46,46,57,57,57,57,99,99,99,111,111,122)
Cases   = c(7,2,4,2,3,5,1,2,3,2,4,1)
df = data.frame(Colour, Volume, Cases)

I want to sum up Cases if Colour is "red" OR "blue" but if Volume is identical. Those colours which are not specified should be kept. If red and blue can't be summed up because they differ in Volume then they should also be kept

The reult should look like that:

Colour = c("red_or_blue","red_or_blue","yellow","green","red_or_blue","green","red","yellow","blue")
Volume  = c(46,57,57,57,99,99,111,111,122)
Cases   = c(9,6,3,5,3,3,2,4,1)
df_agg = data.frame(Colour, Volume, Cases)

I've figured out a way where I create a further column which assigns an "red_or_blue" to the row with red or blue and an x for the remaining rows. I then used aggregate:

df$test = ifelse(df$Colour %in% c("red", "blue"),"red_or_blue","x")
df_agg = aggregate(df$Cases, list(df$Volume, df$test), sum)

It works but i found this a bit cumbersome. Is there a more handy way that would skip creating an extra column? In future I need to sum up cases for red/blue AND for Volume 57/99. Having the extra column appears to make it a bit more tricky.

Also, I didn't manage to get the original colour being taken over if it's not red nor blue. I tried it this way but it woudln't work:

df$test = ifelse(df$Colour %in% c("red", "blue"),"red_or_blue",df$Colour)

Cheers, Paul

Upvotes: 2

Views: 1786

Answers (2)

Carl Witthoft
Carl Witthoft

Reputation: 21492

I think if you follow @mrdwab's approach, you can use sapply on each "split volume" to do

df$Cases <- sum(df[(df$Colour =='blue' | df$Colour == 'red'),][,3])

to get the number of cases, and

df$Colour[(df$Colour =='blue' | df$Colour == 'red')] <- 'readandblue'

to change the colornames. I'm also willing to bet there's a 2-line solution using ddply but I'm not an expert w/ that tool (yet).

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193507

Here's a way sticking in base R (but probably not the most efficient way....)

  1. Split your data into groups by Volume

    temp = split(df, df$Volume)
    
  2. Create a quick function to change the values for "red" and "blue" only in groups where there is a "red" AND a "blue" present.

    red.and.blue = function(x) {
      if (sum(c("red", "blue") %in% x$Colour) > 1) {
        x$Colour = gsub("red|blue", "red-and-blue", x$Colour)
      } else {
        x$Colour = as.character(x$Colour)
      }
      x
    }
    
  3. Use that function on your temp object that you created in Step 1:

    temp = lapply(temp, red.and.blue)
    
  4. Use aggregate() to perform the aggregation you need to do. Specify the names in the aggregate() arguments so that you maintain your original column names.

    temp = lapply(temp, function(x) aggregate(list(Cases = x$Cases), 
                                              list(Colour = x$Colour, 
                                                   Volume = x$Volume), sum))
    
  5. Put it back all into a data.frame(). Don't forget to assign a name if you want to store it as is.

    do.call(rbind, temp)
    #             Colour Volume Cases
    # 46    red-and-blue     46     9
    # 57.1         green     57     5
    # 57.2  red-and-blue     57     6
    # 57.3        yellow     57     3
    # 99.1         green     99     3
    # 99.2  red-and-blue     99     3
    # 111.1          red    111     2
    # 111.2       yellow    111     4
    # 122           blue    122     1
    

Upvotes: 1

Related Questions