Mimi
Mimi

Reputation: 3

Optimizing Pyspark UDF on large data

I am trying to optimize this code that creates a dummy when the column's value (of a pyspark dataframe) is in [categories].

When the run is on 100K rows, it takes about 30seconds to run. In my case I have around 20M rows which will take a lot of time.

def create_dummy(dframe,col_name,top_name,categories,**options):
    lst_tmp_col = []
    if 'lst_tmp_col' in options:
        lst_tmp_col = options["lst_tmp_col"]
    udf = UserDefinedFunction(lambda x: 1 if x in categories else 0, IntegerType())
    dframe  = dframe.withColumn(str(top_name), udf(col(col_name))).cache()
    dframe = dframe.select(lst_tmp_col+ [str(top_name)])
    return dframe 

In other words, how do I optimize this function and cut the total time down regarding the volume of my data? And how to make sure that this function does not iterates over my data?

Appreciate your suggestions. Thanks

Upvotes: 0

Views: 421

Answers (1)

mck
mck

Reputation: 42332

You don't need a UDF for encoding the categories. You can use isin:

import pyspark.sql.functions as F

def create_dummy(dframe, col_name, top_name, categories, **options):
    lst_tmp_col = []
    if 'lst_tmp_col' in options:
        lst_tmp_col = options["lst_tmp_col"]
    dframe = dframe.withColumn(str(top_name), F.col(col_name).isin(categories).cast("int")).cache()
    dframe = dframe.select(lst_tmp_col + [str(top_name)])
    return dframe 

Upvotes: 1

Related Questions