Reputation: 67
I have a pyspark pipeline (containing imputations and a machine learning model) and a pandas dataframe. Can I apply the pipeline on this pandas dataframe without converting it to Pyspark dataframe? If not possible, how do I efficiently use the pyspark pipeline to generate the predictions on pandas dataframe?
Upvotes: 1
Views: 826
Reputation: 660
I'm afraid it is not possible to do so. Pyspark models are not working the same way as other python libraries such as Sk-Learn etc. You will need a Spark data frame for Spark Pipeline(s) or/and ML models. Also, at the same time, you usually use Spark [PySpark] because you want to distribute your competition. [Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best fit which could processes operations many times(100x) faster than Pandas.]
You can read here in the documentation that you need a spark [SQL] data frame:
https://spark.apache.org/docs/latest/ml-pipeline.html#main-concepts-in-pipelines
https://spark.apache.org/docs/latest/ml-pipeline.html#dataframe
Hope this helps!
Upvotes: 1