Reputation: 5340
I'm learning pyspark and mllib.
After predicting the test data using A RF model, I'm assigning the result in a variable called 'predictions' which is a RDD.
If I call predictions.count() or prediction.collect(), then it is failing with the following exception.
Can you please share your thoughts? Already spent quite some time, but didn't find what is missing.
predictions = predict(training_data, test_data)
File "/mp5/part_d_poc.py", line 36, in predict
print(predictions.count())
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 28, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 7
I constructed the training data in the following way.
raw_training_data.map(lambda row: LabeledPoint(row.split(',')[-1], Vectors.dense(row.split(',')[0:-1])))
Upvotes: 2
Views: 61
Reputation: 876
It seems like this error is caused when there's a mismatch between the schema and data. Please refer to these -
Upvotes: 0