Reputation: 67

Applying Pyspark pipeline on pandas dataframe

I have a pyspark pipeline (containing imputations and a machine learning model) and a pandas dataframe. Can I apply the pipeline on this pandas dataframe without converting it to Pyspark dataframe? If not possible, how do I efficiently use the pyspark pipeline to generate the predictions on pandas dataframe?

Upvotes: 1

Answers (1)

JohnDoe_Scientist

Reputation: 660

I'm afraid it is not possible to do so. Pyspark models are not working the same way as other python libraries such as Sk-Learn etc. You will need a Spark data frame for Spark Pipeline(s) or/and ML models. Also, at the same time, you usually use Spark [PySpark] because you want to distribute your competition. [Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark is the best fit which could processes operations many times(100x) faster than Pandas.]

You can read here in the documentation that you need a spark [SQL] data frame:

Hope this helps!

Upvotes: 1

Applying Pyspark pipeline on pandas dataframe

Answers (1)

Related Questions