Sarvavyapi
Sarvavyapi

Reputation: 850

Replacement for Spark's CASE WHEN THEN

I am new to Spark and am trying to optimize code written by another developer. The scenario is as follows:

  1. There is a list of dictionaries with three key-value pairs. One is source:value, second is target:value and third is column:value.
  2. CASE WHEN THEN statement is generated based on above three key-value pairs. For instance, the list of dictionaries is as follows:
values = [{'target': 'Unknown', 'source': '', 'column': 'gender'}, 
{'target': 'F', 'source': '0', 'column': 'gender'}, 
{'target': 'M', 'source': '1', 'column': 'gender'}, 
{'target': 'F', 'source': 'F', 'column': 'gender'}, 
{'target': 'F', 'source': 'Fe', 'column': 'gender'}]
  1. The following code generates the CASE WHEN THEN statement that follows.
for value in values:
    source_value = value.get("source")
    op = op.when(df[column] == source, value.get("target"))
Column<'CASE WHEN (gender = ) THEN Unknown 
WHEN (gender = 0) THEN F 
WHEN (gender = 1) THEN M 
WHEN (gender = F) THEN F 
WHEN (gender = Fe) THEN F END'>
  1. This CASE WHEN THEN is then used to select data from a dataframe.

Question: Is the usage of CASE WHEN THEN valid here (is it optimized)? Some of the CASE WHEN statements are very very lengthy (around 1000+). Is there a better way to redo the code (regex perhaps)?

I looked at the below questions, but were not relevant for my case. CASE WHEN ... THEN SPARK SQL - case when then

Thanks.

Upvotes: 2

Views: 539

Answers (1)

Yosi Dahari
Yosi Dahari

Reputation: 6999

Two alternatives:

  • Use UDF, in which you can access a dictionary of values
  • Build a table, and perform broadcast join

The way to know which is better is by examining the execution plan, job duration and total shuffle.

Upvotes: 1

Related Questions