Optimizing Pyspark UDF on large data

Question

I am trying to optimize this code that creates a dummy when the column's value (of a pyspark dataframe) is in [categories].

When the run is on 100K rows, it takes about 30seconds to run. In my case I have around 20M rows which will take a lot of time.

def create_dummy(dframe,col_name,top_name,categories,**options):
    lst_tmp_col = []
    if 'lst_tmp_col' in options:
        lst_tmp_col = options["lst_tmp_col"]
    udf = UserDefinedFunction(lambda x: 1 if x in categories else 0, IntegerType())
    dframe  = dframe.withColumn(str(top_name), udf(col(col_name))).cache()
    dframe = dframe.select(lst_tmp_col+ [str(top_name)])
    return dframe

In other words, how do I optimize this function and cut the total time down regarding the volume of my data? And how to make sure that this function does not iterates over my data?

Appreciate your suggestions. Thanks

mck · Accepted Answer

You don't need a UDF for encoding the categories. You can use isin:

import pyspark.sql.functions as F

def create_dummy(dframe, col_name, top_name, categories, **options):
    lst_tmp_col = []
    if 'lst_tmp_col' in options:
        lst_tmp_col = options["lst_tmp_col"]
    dframe = dframe.withColumn(str(top_name), F.col(col_name).isin(categories).cast("int")).cache()
    dframe = dframe.select(lst_tmp_col + [str(top_name)])
    return dframe

Optimizing Pyspark UDF on large data

Answers (1)

Related Questions