Ajg
Ajg

Reputation: 257

PySpark dataframe pipeline throws No plan for MetastoreRelation Error

After preprocessing the pyspark dataframe , I am trying to apply pipeline to it but I am getting below error:

java.lang.AssertionError: assertion failed: No plan for MetastoreRelation.

What is the meaning of this and how to solve this. My code has become quite large, so I will explain the steps 1. I have 8000 columns and 68k rows in my spark dataframe. Out of 8k columns, 500 are categorical to which I applied pyspark.ml one hot encoding as a stage in ml.pipeline encoders2 = [OneHotEncoder(inputCol=c, outputCol="{0}_enc".format(c)) for c in cat_numeric[i:i+2]]
but this is very slow and even after 3 hours it was not complete. I am using 40gb memory on each of 12 nodes!. 2. So I am reading 100 columns from pyspark dataframe , creating pandas dataframe from that and doing one hot encoding. Then I transform pandas daaframe back into pyspark data and merge it with original dataframe. 3. Then I try to apply pipeline with stages of string indexer and OHE for categorical string features which are just 5 and then assembler to create 'features' and 'labels' . But in this stage I get the above error. 4. Please let me know if my approach is wrong or if I am missing anything. Also let me know if you want more information. Thanks

Upvotes: 0

Views: 1505

Answers (1)

Ajg
Ajg

Reputation: 257

This error was due to the order of joining the 2 pyspark dataframes. I tried changing the order of join from say a.join(b) to b.join(a) and its working.

Upvotes: 1

Related Questions