Allen
Allen

Reputation: 712

R Get Sum of Column

I have a simple dataframe in R

df1 <- data.frame(
  questionID = c(1,1,3,4,5,5),
   userID = c(101, 101, 102, 101, 102,101),
   Value=c(10,20,30,40,50,10))

The basic idea is to have a column that indicates the sum of value for a user on questions they asked before (lower number questions).

I tried using this function (after trying the pipe of sum which just gave errors about non-numeric that everybody seems to face)

f2 <- function(x){
     Value_out <- filter(df1,questionID<x['questionID'] & userID == x['userID'] ) %>%
       select(Value) %>%
       summarize_if(is.numeric, sum, na.rm=TRUE)
}

out=mutate(df1,Expert=apply(df1, 1,f2))

While this works if you print it out, the Expert column is saved as a list of dataframes. All I want is an int or numeric of the sum of Value. Is there anyway to do this? By the way, yes, I've looked all over for ways to do this, with most answers just summarizing the column in a manner that won't work for me.

Upvotes: 0

Views: 109

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388807

Using dplyr and purrr::map_dbl one approach would be to group_by userID and sum Value for each questionID which is less than current value.

library(dplyr)

df1 %>%
  group_by(userID) %>%
  mutate(Expert = purrr::map_dbl(questionID, ~sum(Value[questionID < .x])))

#  questionID userID Value Expert
#       <dbl>  <dbl> <dbl>  <dbl>
#1          1    101    10      0
#2          1    101    20      0
#3          3    102    30      0
#4          4    101    40     30
#5          5    102    50     30
#6          5    101    10     70

Upvotes: 1

ander2ed
ander2ed

Reputation: 1338

I think I would avoid writing my own function altogether and use data.table on this one. You can do what you want in just a couple lines, and I wouldn't be surprised if there was a way to golf this down to fewer lines

Given your same data, we create a data.table object:

library(data.table)
dt <- data.table(
  questionID = c(1,1,3,4,5,5),
  userID = c(101, 101, 102, 101, 102,101),
  Value=c(10,20,30,40,50,10))

Next, we shift our values by 1 (lag) within each userID:

dt[, lastVal := shift(Value, n = 1, fill = 0), by = .(userID)]

And finally, we cumsum those by userID, and replace those with multiple Values with the same userID x questionID with the min Expert, which should be 0 because we used fill = 0 in shift above before we cumsum:

dt[, 
   Expert := cumsum(lastVal),
   by = .(userID)][,
                   Expert := min(Expert),
                   by = .(userID, questionID)]

So, putting that all together, we have:

library(data.table)
dt <- data.table(
  questionID = c(1,1,3,4,5,5),
  userID = c(101, 101, 102, 101, 102,101),
  Value=c(10,20,30,40,50,10))
dt[, lastVal := shift(Value, n = 1, fill = 0), by = .(userID)]
dt[, 
   Expert := cumsum(lastVal),
   by = .(userID)][,
                   Expert := min(Expert),
                   by = .(userID, questionID)]

dt

   questionID userID Value lastVal Expert
1:          1    101    10       0      0
2:          1    101    20      10      0
3:          3    102    30       0      0
4:          4    101    40      20     30
5:          5    102    50      30     30
6:          5    101    10      40     70

Upvotes: 1

Related Questions