Most efficient way to loop through each observation in a data frame

Question

I'm trying to find the most efficient way to loop through a data frame and cluster observations by groups of 5. For example, if I have:

group <- c(1,2,3,4,5,6,7,8,9,10)
people <- c(1,2,3,4,4,3,2,1,2,3)
avg_age <- c(5,10,15,20,25,30,35,40,45,50)
data <- data.frame(group,people,age)

This should generate

   group people avg_age
1      1      1   5
2      2      2  10
3      3      3  15
4      4      4  20
5      5      4  25
6      6      3  30
7      7      2  35
8      8      1  40
9      9      1  45
10    10      2  50

I'd then like to make another "cluster" of groups with at least 5 people in it with a weighted average age for the "cluster." But I'd like to do this in the most efficient way by going through the data set and sequentially adding groups until a "cluster" is made with at least 5 people. Our data should then look like:

   group people age cluster tot_ppl avg_age
1      1      1   5       1       6   11.67
2      2      2  10       1       6   11.67
3      3      3  15       1       6   11.67
4      4      4  20       2       8    22.5
5      5      4  25       2       8    22.5
6      6      3  30       3       5      32
7      7      2  35       3       5      32
8      8      1  40       4       6   46.67
9      9      2  45       4       6   46.67
10    10      3  50       4       6   46.67

I'd like to do something like this on a dataset with roughly 10,000 observations instead of 10. Does anyone have an idea of an efficient way of going about this?

Here's what I've got so far, however, for some of the data samples I'm working with, there are actually closer to 2 million observations so it can take quite a while to run...

data$cluster <- 0
count=0

while (min(data$cluster)==0)
#while (max(data$cluster)<=10)
{
count = count+1
data$cum <- ave(data$people, by=list(data$zipcode,data$cluster), FUN=cumsum) 
data$a <- floor(data$cum/10)
data$b <- data$cum-data$n1
data$c <- floor(data$b/10)
data$cluster[data$c==0] = data$cluster[data$c==0]+1
}

extravars <- c('cum','a','b','c')
for (inc.source in extravars){
  eval(parse(text = paste("data$",inc.source,"<-NULL",sep="")))         
}

data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), FUN=sum) 
data$cluster[data$tot_ppl<10]=data$cluster[data$tot_ppl<10]+1
data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), FUN=sum)

data2 <- data


for (i in 3:(ncol(data2)-3)){
  data2$x <- data2[ ,i]*data2$tot_ppl
  data2$x <- ave(data2$x, by=list(data2$zipcode,data2$cluster), FUN=sum)
  data2$x <- round(data2$x/data2$tot_ppl,digits=2)
  data2[ ,i] = data2$x
}

data2$x <- NULL

So while this works, it takes a few hours to run, so if anybody knows a way to make this more efficient or improve it, I'd greatly appreciate it. Thanks!

Most efficient way to loop through each observation in a data frame

Answers (1)

Related Questions