How to parallelize model training with PySpark

I have a large Spark DataFrame (it doesn't fit in memory) that contains data for multiple devices.

For each device there are multiple rows, and the columns contain features and a target. Here is what it looks like with dummy data:

| DeviceID | Feature 1 | Feature 2 | Target |
|----------|-----------|-----------|--------|
|        A |         1 |         2 |      3 |
|        A |         2 |         4 |      6 |
|        B |         1 |         2 |      3 |
|        B |         2 |         4 |      6 |
|        C |         1 |         2 |      3 |
|        C |         2 |         4 |      6 |

I want to fit one linear model for each device, and I would like to parallelize this training to speed up computation.

With the data above, I would end up with 3 models: one for device A, one for device B, one for device C.

Can this be achieved with PySpark?

Using joblib Parallel/delayed doesn't work because spark dataframes can't be serialized.

Upvotes: 0

Answers (4)

Gregory Vial

Reputation: 11

I ended up using applyInPandas which does a nice job parallelizing a task based on a group by of one or several columns the original dataframe.

Upvotes: 0

Ritwik

Reputation: 1

Initialize a SparkContext
Create individual dataframes corresponding to each of your DeviceID (use filter method) and create a list like category_data_list = [category_A_data, category_B_data, category_C_data].

3)Define a function to train model on each of these categories of data

4)Use parallelize function and pass the category_data_list.

5)Repeat the above steps for predictions as well.

Upvotes: 0

nsenno

Reputation: 19

The types of model trainings mentioned by other answers are if you want to train one big model on all of your data. What you are trying to do is train many small models. To do that, what you should do is group the DataFrame by DeviceID and pass each sub dataset to a PandasUDF for training.

Here is a blog post that describes how to do this at scale. Ignore the part about prophet, you can use any ML model you like that works on pandas DataFrames. Note that Spark parallelizes the training so each dataset is trained on a separate executor.

Documentation on PandasUDF

Upvotes: 0

s510

Reputation: 2822

Yes, this can achieved in PySpark. Some ML algorithms already work out of the box, i.e., they are already parallelized using different worker nodes. See these algorithms here : https://spark.apache.org/docs/latest/ml-classification-regression.html

What you need to do is divide the data based on DeviceID and train them in separate linear models. If you like to parallelize the training for 3 different linear models, open 3 different spark sessions and train them on device specific data.

Upvotes: 0

How to parallelize model training with PySpark

Answers (4)

Related Questions