Reputation: 143
I'm using UBL::SmoteClassif() function in R to over-sample minority classes to create a more balanced dataset. I have 8 classes. I had a dataset with 357,038 rows and 147 columns/covariates and it works. I have another dataset with 186,274 rows and 186 columns and it produces the following error:
"Error in neighbours(tgt, dat, dist, p, k) : long vectors (argument 10) are not supported in .Fortran"
Is there a formula I could use where I input the number of columns in the dataset and other parameter settings of the function and it would provide the maximum number of rows the dataset can have for the function to work? This would help me scale my analysis.
Here is a reproducible example that is similar to what I was doing --
library(UBL)
library(tidyverse)
test<-data.frame(replicate(186,sample(0:10000,186274,rep=TRUE)),class=c(rep("Class_1",15735),
rep("Class_2",3767),
rep("Class_3",9874),
rep("Class_4",30670),
rep("Class_5",1540),
rep("Class_6",25109),
rep("Class_7",84307),
rep("Class_8",15272)))
test<-test%>%mutate(class=factor(class, levels=c('Class_1','Class_2','Class_3','Class_4','Class_5','Class_6','Class_7','Class_8')))
l = list(Class_1 = 1.15, Class_2 = 4.19, Class_3 = 1.55, Class_4=1.00,
Class_5=9.81,Class_6=1.04,Class_7=1.01,Class_8=1.00)
datBal <- SmoteClassif(class ~ ., test, C.perc = l)#error
test<-data.frame(replicate(186,sample(0:10000,357038,rep=TRUE)),class=c(rep("Class_1",31878),
rep("Class_2",6406),
rep("Class_3",31351),
rep("Class_4",55430),
rep("Class_5",1598),
rep("Class_6",32293),
rep("Class_7",176013),
rep("Class_8",22069)))
test<-test%>%mutate(class=factor(class, levels=c('Class_1','Class_2','Class_3','Class_4','Class_5','Class_6','Class_7','Class_8')))
l = list(Class_1 = 1.14, Class_2 = 4.84, Class_3 = 1.00, Class_4=1.00,
Class_5=18.2,Class_6=1.57,Class_7=1.00,Class_8=1.33)
datBal <- SmoteClassif(class ~ ., test, C.perc = l)#this works
Link to SmoteClassif source code
Link to foreign function interface
Upvotes: 0
Views: 144