foboi1122
foboi1122

Reputation: 1757

Spark random forest indexoutofbounds exception when training

I am attempting to run MLLIB's random forest model and am getting some out of bounds exceptions:

15/09/15 01:53:56 INFO scheduler.DAGScheduler: ResultStage 5 (collect at DecisionTree.scala:977) finished in 0.147 s
15/09/15 01:53:56 INFO scheduler.DAGScheduler: Job 5 finished: collect at DecisionTree.scala:977, took 0.161129 s
15/09/15 01:53:57 INFO rdd.MapPartitionsRDD: Removing RDD 4 from persistence list
15/09/15 01:53:57 INFO storage.BlockManager: Removing RDD 4
Traceback (most recent call last):
  File "/root/random_forest/random_forest_spark.py", line 142, in <module>
    main()
  File "/root/random_forest/random_forest_spark.py", line 121, in main
    trainModel(dset)
  File "/root/random_forest/random_forest_spark.py", line 136, in trainModel
    impurity='gini', maxDepth=4, maxBins=32)
  File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 352, in trainClassifier
  File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 270, in _train
  File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 128, in callMLlibFunc
  File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 121, in callJavaFunc
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o47.trainRandomForestModel.
: java.lang.IndexOutOfBoundsException: 1337 not in [0,1337)
        at breeze.linalg.SparseVector$mcD$sp.apply$mcD$sp(SparseVector.scala:74)
        at breeze.linalg.SparseVector$mcD$sp.apply(SparseVector.scala:73)
        at breeze.linalg.SparseVector$mcD$sp.apply(SparseVector.scala:49)
        at breeze.linalg.TensorLike$class.apply$mcID$sp(Tensor.scala:94)
        at breeze.linalg.SparseVector.apply$mcID$sp(SparseVector.scala:49)
        at org.apache.spark.mllib.linalg.Vector$class.apply(Vectors.scala:102)
        at org.apache.spark.mllib.linalg.SparseVector.apply(Vectors.scala:636)
        at org.apache.spark.mllib.tree.DecisionTree$$anonfun$26.apply(DecisionTree.scala:992)
        at org.apache.spark.mllib.tree.DecisionTree$$anonfun$26.apply(DecisionTree.scala:992)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
        at org.apache.spark.mllib.tree.DecisionTree$.findSplitsBins(DecisionTree.scala:992)
        at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:151)
        at org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:289)
        at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:666)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)

I ran the sample python code here using data/mllib/sample_libsvm_data.txt which ran correctly. However when I use my own RDD, I get the error described above. The format of my RDD entries are LabeledPoint from mllib while each labeled point's indicies are described by a mllib SparseVector. I am loading the data for the sparsevectors from a numpy csr matrix.

I didn't really see much of a difference from the sample loaded data and my own data. But I did notice that the error seems to always invoke on the last element of my RDD.

Edit: Sample test case with my data trained on a random forest yielded the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o46.trainRandomForestModel.
: java.lang.IndexOutOfBoundsException: 1071 not in [0,1071)

I then tried looking more into my data with the following:

>>> dset = data.collect()
>>> dset[-1].features.size
1721

each entry is the following type:

>>> type(dset[-1].features)
<class 'pyspark.mllib.linalg.SparseVector'>

The output of dset[-1] is of the form:

LabeledPoint(0.0, (2286,[44673,64508,65588,122081,306819,306820,382530,401432,465330,465336,505179,512444,512605,517844,526648,595536,595540,615236,628547,629226,810553,938019,1044478,1232743,... ... ...],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,... ... .. ]))

Note that the size of the number of features is the same is the error message's index.

Upvotes: 0

Views: 2530

Answers (2)

itsajitsharma
itsajitsharma

Reputation: 21

Adding one more important point to foboi1122's answer. Since the RDD of LabeledPoint contains collection of LabeledPoint, all these LabeledPoint should have the Vector_size as the (max index from all the LabeledPoint in the RDD) + 1. Addition of one is done because vector size is always one more than the max index in vector.

So basically you can not have these two LabeledPoint in the RDD.

LabeledPoint (1.0,(29,[28],[32551.0])),
LabeledPoint (0.0,(12,[11],[18.0]))

instead it should be

LabeledPoint (1.0,(29,[28],[32551.0])),
LabeledPoint (0.0,(29,[11],[18.0]))

Upvotes: 0

foboi1122
foboi1122

Reputation: 1757

I found the reason I was getting these errors so I am posting it here in case someone else runs into it as well.

tl;dr I had the wrong value stored for SparseVector's size.

My instances of LabeledPoint objects for MLLIB hold label and features, where features should be a SparseVector object. This sparse object is declared using SparseVector(vector_size, nonzero_indices, data).

However, I accidentally used number of nonzero values as vector_size. This can be seen in my example LabeledPoint output LabeledPoint(0.0, (2286,[44673,64508, ...

Here we can see that I declared my size as 2286, however even my first index (44673) is larger than my declared array size, thus causing me headaches.

Changing 2286 to the correct true non-sparse array size solved the problem

Upvotes: 5

Related Questions