Reputation: 3
I am trying to optimize this code that creates a dummy when the column's value (of a pyspark dataframe) is in [categories].
When the run is on 100K rows, it takes about 30seconds to run. In my case I have around 20M rows which will take a lot of time.
def create_dummy(dframe,col_name,top_name,categories,**options):
lst_tmp_col = []
if 'lst_tmp_col' in options:
lst_tmp_col = options["lst_tmp_col"]
udf = UserDefinedFunction(lambda x: 1 if x in categories else 0, IntegerType())
dframe = dframe.withColumn(str(top_name), udf(col(col_name))).cache()
dframe = dframe.select(lst_tmp_col+ [str(top_name)])
return dframe
In other words, how do I optimize this function and cut the total time down regarding the volume of my data? And how to make sure that this function does not iterates over my data?
Appreciate your suggestions. Thanks
Upvotes: 0
Views: 421
Reputation: 42332
You don't need a UDF for encoding the categories. You can use isin
:
import pyspark.sql.functions as F
def create_dummy(dframe, col_name, top_name, categories, **options):
lst_tmp_col = []
if 'lst_tmp_col' in options:
lst_tmp_col = options["lst_tmp_col"]
dframe = dframe.withColumn(str(top_name), F.col(col_name).isin(categories).cast("int")).cache()
dframe = dframe.select(lst_tmp_col + [str(top_name)])
return dframe
Upvotes: 1