Reputation: 91
I'm trying to find the most efficient way to loop through a data frame and cluster observations by groups of 5. For example, if I have:
group <- c(1,2,3,4,5,6,7,8,9,10)
people <- c(1,2,3,4,4,3,2,1,2,3)
avg_age <- c(5,10,15,20,25,30,35,40,45,50)
data <- data.frame(group,people,age)
This should generate
group people avg_age
1 1 1 5
2 2 2 10
3 3 3 15
4 4 4 20
5 5 4 25
6 6 3 30
7 7 2 35
8 8 1 40
9 9 1 45
10 10 2 50
I'd then like to make another "cluster" of groups with at least 5 people in it with a weighted average age for the "cluster." But I'd like to do this in the most efficient way by going through the data set and sequentially adding groups until a "cluster" is made with at least 5 people. Our data should then look like:
group people age cluster tot_ppl avg_age
1 1 1 5 1 6 11.67
2 2 2 10 1 6 11.67
3 3 3 15 1 6 11.67
4 4 4 20 2 8 22.5
5 5 4 25 2 8 22.5
6 6 3 30 3 5 32
7 7 2 35 3 5 32
8 8 1 40 4 6 46.67
9 9 2 45 4 6 46.67
10 10 3 50 4 6 46.67
I'd like to do something like this on a dataset with roughly 10,000 observations instead of 10. Does anyone have an idea of an efficient way of going about this?
Here's what I've got so far, however, for some of the data samples I'm working with, there are actually closer to 2 million observations so it can take quite a while to run...
data$cluster <- 0
count=0
while (min(data$cluster)==0)
#while (max(data$cluster)<=10)
{
count = count+1
data$cum <- ave(data$people, by=list(data$zipcode,data$cluster), FUN=cumsum)
data$a <- floor(data$cum/10)
data$b <- data$cum-data$n1
data$c <- floor(data$b/10)
data$cluster[data$c==0] = data$cluster[data$c==0]+1
}
extravars <- c('cum','a','b','c')
for (inc.source in extravars){
eval(parse(text = paste("data$",inc.source,"<-NULL",sep="")))
}
data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), FUN=sum)
data$cluster[data$tot_ppl<10]=data$cluster[data$tot_ppl<10]+1
data$tot_ppl <- ave(data$people, by=list(data$zipcode,data$cluster), FUN=sum)
data2 <- data
for (i in 3:(ncol(data2)-3)){
data2$x <- data2[ ,i]*data2$tot_ppl
data2$x <- ave(data2$x, by=list(data2$zipcode,data2$cluster), FUN=sum)
data2$x <- round(data2$x/data2$tot_ppl,digits=2)
data2[ ,i] = data2$x
}
data2$x <- NULL
So while this works, it takes a few hours to run, so if anybody knows a way to make this more efficient or improve it, I'd greatly appreciate it. Thanks!
Upvotes: 0
Views: 789
Reputation: 44340
I can't really think of a clever way to vectorize this operation, so you could just use a for loop in R:
pureR <- function(x, lim) {
cs <- cumsum(x)
newGroup <- rep(FALSE, length(x))
prevSum <- 0
for (i in 1:length(newGroup)) {
if (cs[i] - prevSum >= lim) {
newGroup[i] <- TRUE
prevSum <- cs[i]
}
}
return(1+c(0, head(cumsum(newGroup), -1)))
}
pureR(dat$people, 5)
# [1] 1 1 1 2 2 3 3 4 4 4
You can use the Rcpp
package to speed up non-vectorized computations:
library(Rcpp)
rcpp <- cppFunction("
NumericVector rcpp(NumericVector x, const double limit) {
NumericVector result(x.size());
result[0] = 1;
double acc = x[0];
for (int i=1; i < x.size(); ++i) {
if (acc >= limit) {
result[i] = result[i-1] + 1;
acc = x[i];
} else {
result[i] = result[i-1];
acc += x[i];
}
}
return result;
}
")
rcpp(dat$people, 5)
# [1] 1 1 1 2 2 3 3 4 4 4
Finally, we can benchmark on a dataset with 10,000 observations:
set.seed(144)
dat2 <- dat[sample(1:nrow(dat), 10000, replace=TRUE),]
library(microbenchmark)
microbenchmark(pureR(dat2$people, 5), rcpp(dat2$people, 5))
# Unit: microseconds
# expr min lq mean median uq max neval
# pureR(dat2$people, 5) 7073.571 7287.733 8665.394 7822.639 8749.232 31313.946 100
# rcpp(dat2$people, 5) 90.309 98.241 129.120 118.351 136.210 324.866 100
While the Rcpp code is more than 60x faster than the pure R implementation, the pure R implementation is still running in less than 10 milliseconds for a dataset of size 10,000, which is probably fine for you.
Upvotes: 1