How to use Pandas UDF in Class

Question

I am trying to figure out how to use self in PandasUDF.GroupBy.Apply in a Class method in Python and also pass arguments in it. I have tried a lot of different ways but couldn't make it work. I also searched the internet extensively looking for an example of PandasUDF which is used inside a class with self and arguments but could not find anything like that. I know how to do all of the before mentioned things with Pandas.GroupBy.Apply.

The only way through which i could make it work was by declaring it static-method

class Train:
    return_type = StructType([
        StructField("div_nbr", FloatType()),
        StructField("store_nbr", FloatType()),
        StructField("model_str", BinaryType())
    ])
    function_type = PandasUDFType.GROUPED_MAP

    def __init__(self):
       ............

    def run_train(self):
         output = sp_df.groupby(['A', 'B']).apply(self.model_train)
         output.show(10)

    @staticmethod
    @pandas_udf(return_type, function_type)
    def model_train(pd_df):
        features_name = ['days_into_year', 'months_into_year', 'minutes_into_day', 'hour_of_day', 'recency']

        X = pd_df[features_name].copy()
        Y = pd.DataFrame(pd_df['trans_type_value']).copy()

        estimator_1 = XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=300, verbosity=1,
                                   objective='reg:squarederror', booster='gbtree', n_jobs=-1, gamma=0,
                                   min_child_weight=5, max_delta_step=0, subsample=0.6, colsample_bytree=0.8,
                                   colsample_bylevel=1, colsample_bynode=1, reg_alpha=0, reg_lambda=1,
                                   scale_pos_weight=1, base_score=0.5, random_state=1234, missing=None,
                                   importance_type='gain')
        estimator_1.fit(X, Y)
        df_to_return = pd_df[['div_nbr', 'store_nbr']].drop_duplicates().copy()
        df_to_return['model_str'] = pickle.dumps(estimator_1)

        return df_to_return

What i would like to achieve in reality is that declare return_type and function_type, features_name in __init__(), then use it in PandasUDF, also pass parameters to be used inside the function when doing PandasUDF.GroupBy.Apply

If anyone could help me out, I would highly appreciate that. I am a bit newbie to PySpark.

How to use Pandas UDF in Class

Answers (1)

Background

The Solutions

Not in a Class

Related Questions