Reputation: 93
I have created a DataFrame with my data to run some Machine Learning experiments. I'm trying to split it into training and test sets by using the randomSplit() function, but it is giving me some exceptions which I can't figure out the reason. My code is similar to this:
Features = ['A', 'B', 'C', 'D', 'E', 'aVec', 'bVec', 'cVec', 'dVec']
vec = VectorAssembler(inputCols = Features, outputCol = 'features')
df = vec.transform(df)
df = df.select("features", "Target")
(train, test) = df.randomSplit([0.8, 0.2])
print(df.count())
print(train.count())
print(test.count())
The letters inside 'Features' represent numeric features, and the *Vec elements represent OneHotEncoding vectors (created using OneHotEncoding() function of pyspark).
When Spark reaches print(train.count()) it launches the following exception:
Py4JJavaError: An error occurred while calling o2274.count.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 5 in stage 1521.0 failed 1 times, most recent failure: Lost task
5.0 in stage 1521.0 (TID 122477, localhost, executor driver):
java.lang.IllegalAccessError: tried to access field
org.apache.spark.sql.execution.BufferedRowIterator.partitionIndex from
class
The print on df works well, so I'm thinking that randomSplit is corrupting my data somehow.
I did a small test, and if I remove any of the OneHotEncoding Vectors it starts to work for some reason. (I removed 'aVec', for example, and it worked). The problem does not seem to be about a specific column because I could remove any of them (if I ran my code with Features = ['aVec', 'bVec', 'cVec'] or Features = ['bVec', 'cVec', 'dVec'] it will work, but not with Features = ['aVec', 'bVec', 'cVec', 'dVec']).
Is there a reason for this error I'm getting?
Upvotes: 3
Views: 2909
Reputation: 21
I've got a similar problem recently,
Making the VectorAssembler handle invalid entries in the dataframe resolved my issue:
df = vec.transform(df).setHandleInvalid("skip").transform(df)
Upvotes: 1
Reputation: 93
I had the same problem, mine was solved by removing blank values from my data. I had several blank values in one of the inputcols, they were not NAs or NULLs, but just a space: " ". That caused the same error you're describing above. I filtered them out using raw_data = raw_data.filter('YourColumn != " "')
.
Hope this helps for you too.
Upvotes: 2