DCRubyHound
DCRubyHound

Reputation: 343

sum by group, but exclude values below designated id

I'm trying to go from this:

#Starting df
name = c("tom", "tom", "tom", "chris", "chris", "chris", "jen", "jen", "jen") 
value = c(2,10,"end",45,"end",13,6,"end",13) 
df = data.frame(name,value) 

or this (which uses NA as the cutoff)

#Starting df
name = c("tom", "tom", "tom", "chris", "chris", "chris", "jen", "jen", "jen") 
value = c(2,10,NA,45,NA,13,6,NA,13) 
starting_df = data.frame(name,value) 

To this:

#Ending df
name = c("tom", "tom", "tom", "chris", "chris", "chris", "jen", "jen", "jen") 
value = c(12,12,12,45,45,45,6,6,6) 
ending_df = data.frame(name,value) 

The idea here is to sum by group(name in this case), which I can easily do using group_by function from dplyr, but I need to remove all values beneath NAs or text that states end. The values beneath these cutoffs cannot be included in my sum. I've been working on this all night and am out of ideas. Thanks for your help.

Upvotes: 4

Views: 126

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389235

A base R option would be using ave where for each group (name) we find the index where the first NA occurs using which.max and sum values until that index.

starting_df$value <- ave(starting_df$value, starting_df$name, FUN = function(x) 
                                sum(x[1:which.max(is.na(x)) - 1]))

starting_df

#   name value
#1   tom    12
#2   tom    12
#3   tom    12
#4 chris    45
#5 chris    45
#6 chris    45
#7   jen     6
#8   jen     6
#9   jen     6

As @thelatemail commented, we can also do this using cumsum. We select the maximum value of cumsum from each group ignoring the NAs.

starting_df$value <- ave(starting_df$value, starting_df$name, FUN = function(x) 
                              max(cumsum(x), na.rm = TRUE))

Upvotes: 2

akrun
akrun

Reputation: 887851

After grouping by 'name', convert the 'value' to numeric (in case if the column is character, if it is factor, use as.numeric(as.character(value))) so that the non-numeric elements get converted to NA, then get the index of the first NA element with which and is.na, and find the sum of 'value' until that row

starting_df %>%
    group_by(name) %>%
    mutate(value = as.numeric(value), 
           value = sum(value[seq_len(which(is.na(value))[1])], na.rm = TRUE))
# A tibble: 9 x 2
# Groups:   name [3]
#    name value
#  <fctr> <dbl>
#1    tom    12
#2    tom    12
#3    tom    12
#4  chris    45
#5  chris    45
#6  chris    45
#7    jen     6
#8    jen     6
#9    jen     6

Upvotes: 2

Related Questions