vahab
vahab

Reputation: 327

How to pull out values corresponding to a random selection and get the cumulative summation of them?

Let's say I have a data frame with two columns for now:

df<- data.frame(scores_set1=c(32,45,65,96,45,23,23,14),
scores_set2=c(32,40,60,98,21,23,21,63))

I want to randomly select some rows

selected_indeces<- sample(c(1:8), 4, replace = FALSE)

Now I want to add up the values of selected_indeces sequentially meaning that for first selected_indeces I just need the value of that specific row, for the second I want the second row value + the first selected value ... and for the nth index I want sum of all values selected already + the value nth row. So, first need a matrix to put the results in

   cumulative_loss<-matrix(rep(NA,8*2),nrow=8,ncol=2) 

and then one loop for each column and another for each selected_index

for (s in 1:ncol(df)) #for each column
     {
    for (i in 1:length(selected_indeces)) #for each randomly selected index
      {
      if (i==1)
        {
       cumulative_loss[i,s]<- df[selected_indeces[i],s]
        }

      if (i > 1)
         {
    cumulative_loss[i,s]<- df[selected_indeces[i],s] +  
      df[selected_indeces[i-1],s]
    }
  }
}
The script works although It might be a naive way for doing such thing but the thing is that if (i=4) is only adds values of 4th and third selection while I want it to add first, second , third and fourth random selection and return it.

Upvotes: 0

Views: 95

Answers (3)

MichaelChirico
MichaelChirico

Reputation: 34703

Here's a way to do this with data.table (taking into account your comment on @bgoldst's answer:

library(data.table); setDT(df)

#sample 4 elements of each column (i.e., every element of .SD), then cumsum them
df[ , lapply(.SD, function(x) cumsum(sample(x, 4)))]

If you want to use different indices for each column, I would pre-choose them first:

set.seed(1023)
idx <- lapply(integer(ncol(df)), function(...) sample(nrow(df), 4))
idx
# [[1]] #indices for column 1
# [1] 2 8 6 3
# 
# [[2]] #indices for column 2
# [1] 4 8 5 1

Then modify the above slightly:

df[ , lapply( seq_along(.SD), function(jj) cumsum(.SD[[jj]][ idx[[jj]] ]) )]

This is the craziest compendium of brackets/parentheses I've ever written in a functional line of code, so I guess it makes sense to break things down a bit:

  • seq_along .SD to pick out the index number of each column, jj
  • .SD[[jj]] selects the jth column, idx[[jj]] selects the indices for that column, .SD[jj]][idx[jj]]] picks the idx[[jj]] rows of the jth column; this is equivalent to .SD[idx[jj], jj, with = FALSE]
  • Lastly, we cumsum the length(idx[[jj]]) rows we chose for column jj.

Result:

#     V1  V2
# 1:  45  98
# 2:  59 161
# 3:  82 182
# 4: 147 214

Upvotes: 2

akrun
akrun

Reputation: 886938

With dplyr, if we want to sample each column separately and do the cumsum, we can use mutate_each and then select the first 4 with head.

library(dplyr)
df %>%
   mutate_each(funs(cumsum(sample(.)))) %>%
   head(.,4)

If the sample needs to be for the whole dataset

df %>%
   slice(sample(row_number(), 4)) %>%
   mutate_each(funs(cumsum))

Upvotes: 0

bgoldst
bgoldst

Reputation: 35314

Conveniently, cumsum() works on data.frames directly, in which case it runs on each column independently. Thus we can index out the selected rows of df with an index operation and pass the result directly to cumsum() to get the required output:

set.seed(0L);
sel <- sample(1:8,4L);
sel;
## [1] 8 2 3 6
df[sel,];
##   scores_set1 scores_set2
## 8          14          63
## 2          45          40
## 3          65          60
## 6          23          23
cumsum(df[sel,]);
##   scores_set1 scores_set2
## 8          14          63
## 2          59         103
## 3         124         163
## 6         147         186

To select different indexes for each column, we can use apply():

set.seed(0L);
apply(df,2L,function(col) cumsum(col[sample(1:8,4L)]));
##      scores_set1 scores_set2
## [1,]          14          63
## [2,]          59         103
## [3,]         124         126
## [4,]         147         147

If you want to compute the indexes in advance, it becomes slightly trickier. Here's one way of doing it:

set.seed(0L);
sels <- replicate(2L,sample(1:8,4L)); sels;
##      [,1] [,2]
## [1,]    8    8
## [2,]    2    2
## [3,]    3    6
## [4,]    6    5
sapply(seq_len(ncol(df)),function(ci) cumsum(df[[ci]][sels[,ci]]));
##      [,1] [,2]
## [1,]   14   63
## [2,]   59  103
## [3,]  124  126
## [4,]  147  147

Upvotes: 3

Related Questions