Noosentin
Noosentin

Reputation: 71

Average columns based on other column value and number of rows in R

I'm using R and am trying to create a new dataframe of averaged results from another dataframe based on the values in Column A. To demonstrate my goal here is some data:

set.seed(1981)
df <- data.frame(A = sample(c(0,1), replace=TRUE, size=100),
B=round(runif(100), digits=4), 
C=sample(1:1000, 100, replace=TRUE))
head(df, 30)

A   B       C
0   0.6739  459
1   0.5466  178
0   0.154   193
0   0.41    206
1   0.7526  791
1   0.3104  679
1   0.739   434
1   0.421   171
0   0.3653  577
1   0.4035  739
0   0.8796  147
0   0.9138  37
0   0.7257  350
1   0.2125  779
0   0.1502  495
1   0.2972  504
0   0.2406  245
1   0.0325  613
0   0.8642  539
1   0.1096  630
1   0.2113  363
1   0.277   974
0   0.0485  755
1   0.0553  412
0   0.509   24
0   0.2934  795
0   0.0725  413
0   0.8723  606
0   0.3192  591
1   0.5557  177

I need to reduce the size of the data by calculating the average value for column B and column C for as many rows as the value in Column A stays consecutively the same, up to a maximum of 3 rows. If value A remains either 1, or 0 for more than 3 rows it would roll over into the next row in the new dataframe as you can see below.

The new dataframe requires the following columns:

Value of A  B.Av         C.Av   No. of rows used
0           0.6739       459            1
1           0.5466       178            1
0           0.282        199.5          2
1           0.600666667  634.6666667    3
1           0.421        171            1
0           0.3653       577            1
1           0.4035       739            1
0           0.8397       178            3
1           0.2125       779            1
0           0.1502       495            1
1           0.2972       504            1
0           0.2406       245            1
1           0.0325       613            1
0           0.8642       539            1
1           0.1993       655.6666667    3
0           0.0485       755            1
1           0.0553       412            1
0           0.291633333  410.6666667    3
0           0.59575      598.5          2
1           0.5557       177            1

I haven't managed to find another similar scenario to mine whilst searching Stack Overflow so any help would be really appreciated.

Upvotes: 1

Views: 1261

Answers (2)

aichao
aichao

Reputation: 7435

Here is a base-R solution:

## define a function to split the run-length if greater than 3
split.3 <- function(l,v) {
  o <- c(values=v,lengths=min(l,3)) 
  while(l > 3) {
    l <- l - 3
    o <- rbind(o,c(values=v,lengths=min(l,3)))
  } 
  return(o)
}

## compute the run-length encoding of column A
rl <- rle(df$A)

## apply split.3 to the run-length encoding
## the first column of vl are the values of column A
## the second column of vl are the corresponding run-length limited to 3
vl <- do.call(rbind,mapply(split.3,rl$lengths,rl$values))

## compute the begin and end row indices of df for each value of A to average
fin <- cumsum(vl[,2])
beg <- fin - vl[,2] + 1

## compute the averages
out <- do.call(rbind,lapply(1:length(beg), function(i) data.frame(`Value of A`=vl[i,1],
                                                                  B.Av=mean(df$B[beg[i]:fin[i]]),
                                                                  C.Av=mean(df$C[beg[i]:fin[i]]),
                                                                  `No. of rows used`=fin[i]-beg[i]+1)))
##   Value.of.A      B.Av     C.Av No..of.rows.used
##1           0 0.6739000 459.0000                1
##2           1 0.5466000 178.0000                1
##3           0 0.2820000 199.5000                2
##4           1 0.6006667 634.6667                3
##5           1 0.4210000 171.0000                1
##6           0 0.3653000 577.0000                1
##7           1 0.4035000 739.0000                1
##8           0 0.8397000 178.0000                3
##9           1 0.2125000 779.0000                1
##10          0 0.1502000 495.0000                1
##11          1 0.2972000 504.0000                1
##12          0 0.2406000 245.0000                1
##13          1 0.0325000 613.0000                1
##14          0 0.8642000 539.0000                1
##15          1 0.1993000 655.6667                3
##16          0 0.0485000 755.0000                1
##17          1 0.0553000 412.0000                1
##18          0 0.2916333 410.6667                3
##19          0 0.5957500 598.5000                2
##20          1 0.5557000 177.0000                1

Upvotes: 2

akuiper
akuiper

Reputation: 214957

Here is a data.table solution:

library(data.table)
setDT(df)
# create two group variables, consecutive A and for each consecutive A every three rows
(df[,rleid := rleid(A)][, threeWindow := ((1:.N) - 1) %/% 3, rleid]

# calculate the average of the columns grouped by the above two variables
   [, c(.N, lapply(.SD, mean)), .(rleid, threeWindow)]

# drop group variables
   [, `:=`(rleid = NULL, threeWindow = NULL)][])

 #   N A         B        C
 #1: 1 0 0.6739000 459.0000
 #2: 1 1 0.5466000 178.0000
 #3: 2 0 0.2820000 199.5000
 #4: 3 1 0.6006667 634.6667
 #5: 1 1 0.4210000 171.0000
 #6: 1 0 0.3653000 577.0000
 #7: 1 1 0.4035000 739.0000
 #8: 3 0 0.8397000 178.0000
 #9: 1 1 0.2125000 779.0000
#10: 1 0 0.1502000 495.0000
#11: 1 1 0.2972000 504.0000
#12: 1 0 0.2406000 245.0000
#13: 1 1 0.0325000 613.0000
#14: 1 0 0.8642000 539.0000
#15: 3 1 0.1993000 655.6667
#16: 1 0 0.0485000 755.0000
#17: 1 1 0.0553000 412.0000
#18: 3 0 0.2916333 410.6667
#19: 2 0 0.5957500 598.5000
#20: 1 1 0.5557000 177.0000

Upvotes: 1

Related Questions