Artemination
Artemination

Reputation: 723

Why creating a caret partition boost ave function?

I have problems using a df with 2 columns and 4.632.351 rows.

The columns are Name and Gender.

I want to count duplicated names and add to a new column, so I use ave function

data2$NmbCmpDup <- as.numeric(ave(data2$Nombre,data2$Nombre,  FUN = length))

But it take so long, maybe 3 hours so I stop the running process.

Then with caret I created a partition, so I could work with less rows..

createDataPartition(data$Genero, p = 0.01, list=F)

So I created a 1% partition and use the ave function

data.p = createDataPartition(data$Genero, p = 1, list=F)
data2 = data[data.p,]
data2$NmbCmpDup <- as.numeric(ave(data2$Nombre,data2$Nombre,  FUN = length))

And then, the ave function boost to 10 seconds.. So I tryied with 5% and still was very fast, So I added more and more percent until I made an 100% partition, and the ave function just took 2 minutes.

Ok, now I wolud like to know why..? any thoughts?

Upvotes: 0

Views: 37

Answers (1)

StupidWolf
StupidWolf

Reputation: 46908

The function is slow because you do not need to use ave to get the length. You can table and then populate the column. Below are 3 solutions, which should be faster than what you have. Also I am not sure if your column for name Nombre is a factor or character.

First an example:

set.seed(100)
data2 = data.frame(Nombre = sample(LETTERS,2e6,replace=TRUE),
Genero = sample(c("M","F"),2e6,replace=TRUE),stringsAsFactors=FALSE)

The functions, I think the data.table is still not optimal, but we can work with it for now:

f1 = function(data2){
data2$NmbCmpDup = as.numeric(ave(data2$Nombre,data2$Nombre,FUN=length))
data2
}
f2 = function(data2){
data2$NmbCmpDup = as.numeric(table(data2$Nombre)[data2$Nombre])
data2
}
f3 = function(data2){
tab = as.data.table(data2)[,.N,by=Nombre]
data2$NmbCmpDup = tab$N[match(data2$Nombre,tab$Nombre)]
data2
}

We test it:

library(microbenchmark)
library(data.table)
Unit: milliseconds
      expr       min        lq     mean   median       uq      max neval cld
 f1(data2) 584.73459 626.12690 670.0398 643.3440 687.0022 911.2973   100   c
 f2(data2) 175.23440 196.36763 229.3775 213.6137 237.8333 407.0434   100  b 
 f3(data2)  73.35966  94.32614 119.9301 104.9643 119.7894 335.6455   100 a  

So just using table or data.table is much much faster than the ave function.

Upvotes: 1

Related Questions