Reputation: 3
I am new to R and Stack Overflow and am hoping you would be able to help me with a question I have. I wrote the following piece of code:
my = matrix(c(1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,
0,1,2,3,5,6,7,10,11,14,
0,1,2,3,4,6,10),ncol = 2, nrow = 17)
colnames(my) = c("ID", "AGE")
my = as.data.frame(my)
my$new = my$ID
system.time(for (i in 1:length(my$ID)) {
ifelse(my$ID[i]==my$ID[i-1],
ifelse(my$AGE[i]-my$AGE[i-1]==1, my$new[i]<-my$new[i-1],my$new[i]<-my$new[i-1]+0.1),
my$new[i]<-my$ID[i])
})
It looks at ID and AGE and if AGE is not equal to previous AGE + 1, then it adds 0.1 to ID and keeps this in the column 'new'. Here is the output:
ID AGE new
1 1 0 1.0
2 1 1 1.0
3 1 2 1.0
4 1 3 1.0
5 1 5 1.1
6 1 6 1.1
7 1 7 1.1
8 1 10 1.2
9 1 11 1.2
10 1 14 1.3
11 2 0 2.0
12 2 1 2.0
13 2 2 2.0
14 2 3 2.0
15 2 4 2.0
16 2 6 2.1
17 2 10 2.2
The problem is that it is really quick for datasets of, say, 1000 rows, but when I try it on my actual dataset, which has more than 8.5m rows, it feels like it would never do it - I tried waiting a few hours with no success.
I would greatly appreciate it if you suggest a way to improve speed/efficiency.
Upvotes: 0
Views: 76
Reputation: 166
When dealing with large datasets, you need to try to vectorize your calculations to improve speed.
Here is an example using dplyr
package (because of the function lag
needed to know the previous value of AGE
):
require(dplyr) # for group_by, mutate and %>%
require(microbenchmark) # to compare codes
# Your data
my <- matrix(c(1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,
0,1,2,3,5,6,7,10,11,14,
0,1,2,3,4,6,10), ncol=2, nrow=17)
colnames(my) <- c("ID", "AGE")
my <- as.data.frame(my)
my$new <- my$ID
my2 <- my[rep(1:nrow(my), times=100),] # larger dataset
# Your function
f1 <- function(my) {
for (i in 1:length(my$ID)) {
ifelse(my$ID[i]==my$ID[i-1],
ifelse(my$AGE[i]-my$AGE[i-1]==1, my$new[i] <- my$new[i-1],
my$new[i] <- my$new[i-1]+0.1),
my$new[i] <- my$ID[i])
}
}
# dplyr function
f2 <- function(my) {
my %>% group_by(ID) %>% # Work by ID
mutate(new2=ifelse(is.na(lag(AGE)), # If lag(AGE) is NA, it's the first record for the ID
FALSE, # Thus, no increase of new3
AGE!=lag(AGE)+1), # new3=TRUE if AGE != 1+previous AGE
new3=new+cumsum(new2)/10) # increase the decimal if TRUE for the previous new2
}
With a small dataset, the for loop is more efficient, but with large dataset, the improvement is clear:
# Compare codes
microbenchmark(f1(my), f2(my),
f1(my2), f2(my2))
Unit: milliseconds
expr min lq mean median uq max neval cld
f1(my) 1.470699 1.957855 2.750798 2.049954 2.243380 62.34942 100 a
f2(my) 2.741614 3.853356 4.235745 4.147085 4.421738 10.84871 100 a
f1(my2) 156.986927 215.515605 218.806729 222.390968 228.362988 290.19161 100 b
f2(my2) 3.398812 4.377638 5.128953 4.659674 5.161190 28.97461 100 a
Upvotes: 1