mjdub
mjdub

Reputation: 91

Most efficient way to loop through each observation in a data frame

I'm trying to find the most efficient way to loop through a data frame and cluster observations by groups of 5. For example, if I have:

group <- c(1,2,3,4,5,6,7,8,9,10)
people <- c(1,2,3,4,4,3,2,1,2,3)
avg_age <- c(5,10,15,20,25,30,35,40,45,50)
data <- data.frame(group,people,age)

This should generate

   group people avg_age
1      1      1   5
2      2      2  10
3      3      3  15
4      4      4  20
5      5      4  25
6      6      3  30
7      7      2  35
8      8      1  40
9      9      1  45
10    10      2  50

I'd then like to make another "cluster" of groups with at least 5 people in it with a weighted average age for the "cluster." But I'd like to do this in the most efficient way by going through the data set and sequentially adding groups until a "cluster" is made with at least 5 people. Our data should then look like:

   group people age cluster tot_ppl avg_age
1      1      1   5       1       6   11.67
2      2      2  10       1       6   11.67
3      3      3  15       1       6   11.67
4      4      4  20       2       8    22.5
5      5      4  25       2       8    22.5
6      6      3  30       3       5      32
7      7      2  35       3       5      32
8      8      1  40       4       6   46.67
9      9      2  45       4       6   46.67
10    10      3  50       4       6   46.67

I'd like to do something like this on a dataset with roughly 10,000 observations instead of 10. Does anyone have an idea of an efficient way of going about this?


Here's what I've got so far, however, for some of the data samples I'm working with, there are actually closer to 2 million observations so it can take quite a while to run...

data$cluster <- 0
count=0

while (min(data$cluster)==0)
#while (max(data$cluster)<=10)
{
count = count+1
data$cum <- ave(data$people, by=list(data$zipcode,data$cluster), FUN=cumsum) 
data$a <- floor(data$cum/10)
data$b <- data$cum-data$n1
data$c <- floor(data$b/10)
data$cluster[data$c==0] = data$cluster[data$c==0]+1
}

extravars <- c('cum','a','b','c')
for (inc.source in extravars){
  eval(parse(text = paste("data$",inc.source,"<-NULL",sep="")))         
}

data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), FUN=sum) 
data$cluster[data$tot_ppl<10]=data$cluster[data$tot_ppl<10]+1
data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), FUN=sum)

data2 <- data


for (i in 3:(ncol(data2)-3)){
  data2$x <- data2[ ,i]*data2$tot_ppl
  data2$x <- ave(data2$x, by=list(data2$zipcode,data2$cluster), FUN=sum)
  data2$x <- round(data2$x/data2$tot_ppl,digits=2)
  data2[ ,i] = data2$x
}

data2$x <- NULL

So while this works, it takes a few hours to run, so if anybody knows a way to make this more efficient or improve it, I'd greatly appreciate it. Thanks!

Upvotes: 0

Views: 789

Answers (1)

josliber
josliber

Reputation: 44340

I can't really think of a clever way to vectorize this operation, so you could just use a for loop in R:

pureR <- function(x, lim) {
    cs <- cumsum(x)
    newGroup <- rep(FALSE, length(x))
    prevSum <- 0
    for (i in 1:length(newGroup)) {
        if (cs[i] - prevSum >= lim) {
            newGroup[i] <- TRUE
            prevSum <- cs[i]
        }
    }
    return(1+c(0, head(cumsum(newGroup), -1)))
}
pureR(dat$people, 5)
# [1] 1 1 1 2 2 3 3 4 4 4

You can use the Rcpp package to speed up non-vectorized computations:

library(Rcpp)
rcpp <- cppFunction("
NumericVector rcpp(NumericVector x, const double limit) {
    NumericVector result(x.size());
    result[0] = 1;
    double acc = x[0];
    for (int i=1; i < x.size(); ++i) {
        if (acc >= limit) {
            result[i] = result[i-1] + 1;
            acc = x[i];
        } else {
            result[i] = result[i-1];
            acc += x[i];
        }
    }
    return result;
}
")
rcpp(dat$people, 5)
# [1] 1 1 1 2 2 3 3 4 4 4

Finally, we can benchmark on a dataset with 10,000 observations:

set.seed(144)
dat2 <- dat[sample(1:nrow(dat), 10000, replace=TRUE),]
library(microbenchmark)
microbenchmark(pureR(dat2$people, 5), rcpp(dat2$people, 5))
# Unit: microseconds
#                   expr      min       lq     mean   median       uq       max neval
#  pureR(dat2$people, 5) 7073.571 7287.733 8665.394 7822.639 8749.232 31313.946   100
#   rcpp(dat2$people, 5)   90.309   98.241  129.120  118.351  136.210   324.866   100

While the Rcpp code is more than 60x faster than the pure R implementation, the pure R implementation is still running in less than 10 milliseconds for a dataset of size 10,000, which is probably fine for you.

Upvotes: 1

Related Questions