berkorbay
berkorbay

Reputation: 465

How to get rpart to work on relatively large number rows (~100k)?

I have a clustering need for my simple but a bit large data set. It has 3 columns and about 120k rows, plus all the data is numeric. I tried to use rpart but got this lovely error.

Error in rep(1, numclass^2) : invalid 'times' argument
In addition: Warning message:
In matrix(rep(1, numclass^2) - diag(numclass), numclass) :
  NAs introduced by coercion

The function has no kinky stuff either.

fit<-rpart(respVar ~ Var1 + Var2, data=varData, method="class")

I have no problem with 1k rows. It is somewhat slow in 10k rows, but still works. No NA values in the dataset. Currently trying that on a Macbook Air, but will try it on a Mac Mini also.

I suspect it is a memory issue, but the warning message concerns me. Is there some workaround to get the clustering work?

Upvotes: 0

Views: 1160

Answers (2)

teddy_pear
teddy_pear

Reputation: 9

I ran into the same problem, but after searching around, I haven't found any solutions.

One way i worked around it is by changing the method="class" to method="anova" (changing from a classification to a regression), and it worked for me.

How many levels are there in your response variable? I think if you have quite a lot of levels for your data set, maybe you could try method="anova"

Upvotes: 0

rischan
rischan

Reputation: 1585

Yes I think so,

It's same error when we tried to use rep function with huge number like :

> x <- rep(0,120000*12000000)
Error in rep(0, 120000 * 1.2e+07) : invalid 'times' argument
In addition: Warning message:
NAs introduced by coercion 

But i just guess, i don't know exactly

Upvotes: 1

Related Questions