Reputation: 850
I am new to Spark and am trying to optimize code written by another developer. The scenario is as follows:
values = [{'target': 'Unknown', 'source': '', 'column': 'gender'},
{'target': 'F', 'source': '0', 'column': 'gender'},
{'target': 'M', 'source': '1', 'column': 'gender'},
{'target': 'F', 'source': 'F', 'column': 'gender'},
{'target': 'F', 'source': 'Fe', 'column': 'gender'}]
for value in values:
source_value = value.get("source")
op = op.when(df[column] == source, value.get("target"))
Column<'CASE WHEN (gender = ) THEN Unknown
WHEN (gender = 0) THEN F
WHEN (gender = 1) THEN M
WHEN (gender = F) THEN F
WHEN (gender = Fe) THEN F END'>
Question: Is the usage of CASE WHEN THEN valid here (is it optimized)? Some of the CASE WHEN statements are very very lengthy (around 1000+). Is there a better way to redo the code (regex perhaps)?
I looked at the below questions, but were not relevant for my case. CASE WHEN ... THEN SPARK SQL - case when then
Thanks.
Upvotes: 2
Views: 539
Reputation: 6999
Two alternatives:
The way to know which is better is by examining the execution plan, job duration and total shuffle.
Upvotes: 1