Subset based on granularity and average values

Question

I have large data-frame consists of two columns. I want to calculate the average of the second column values for each subset of the first column. The subset of the first column is based on a specified granularity. For example, for the following data-frame, df, I want to calculate the average of df$B values for each subset of df$A with an increment(granularity) of 1 for each subset. The results should be in two new columns.

A       B            expected results     newA      newB
0.22096 1                                  0         1.142857
0.33489 1                                  1         2
0.33655 1                                  2         4
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5

This is a simple example, I'm not sure how to loop over the whole data-frame and perform the calculation i.e. the average of the df$B.

tried below to subset, but couldn't figure how to append the results and create final results:

Tried something like :

increment<-1
mx<-max(df$A)
i<-0

newDF<-data.frame()
while(i < mx){
    tmp<-subset(df, (A >i & A< (i+increment)))
    i<-i+granualrity
}

Not sure about the logic. But I'm sure there is a short way to do the required calculation. Any thoughts?

sgibb · Accepted Answer

I would use findInterval for the subset selection (In your example a simple ceiling for each A value should be sufficient, too. But if your increment is different from 1 you need findInterval.) and tapply to calculate the mean:

df <- read.table(textConnection("
A       B
0.22096 1
0.33489 1
0.33655 1
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5"), header=TRUE)

## sort data.frame by column A (needed for findInterval)
df <- df[order(df$A), ]

## define granuality
subsets <- seq(1, max(ceiling(df$A)), by=1) # change the "by" argument for different increments
df$subset <- findInterval(df$A, subsets)

tapply(df$B, df$subset, mean)
#       0        1        2 
#1.142857 2.000000 4.000000

Subset based on granularity and average values

Answers (1)

Related Questions