Reputation: 223
I have to apply certains functions on multiple columns in Pyspark dataframe . Below is my code:
finaldf=df.withColumn('phone_number',regexp_replace("phone_number","[^0-9]",""))\
.withColumn('account_id',regexp_replace("account_id","[^0-9]",""))\
.withColumn('credit_card_limit',regexp_replace("credit_card_limit","[^0-9]",""))\
.withColumn('credit_card_number',regexp_replace("credit_card_number","[^0-9]",""))\
.withColumn('full_name',regexp_replace("full_name","[^a-zA-Z ]",""))\
.withColumn('transaction_code',regexp_replace("transaction_code","[^a-zA-Z]",""))\
.withColumn('shop',regexp_replace("shop","[^a-zA-Z ]",""))
finaldf=finaldf.filter(finaldf.account_id.isNotNull())\
.filter(finaldf.phone_number.isNotNull())\
.filter(finaldf.credit_card_number.isNotNull())\
.filter(finaldf.credit_card_limit.isNotNull())\
.filter(finaldf.transaction_code.isNotNull())\
.filter(finaldf.amount.isNotNull())
From the code you can see there are redundant code I have written which extends the length of the program also . I also learnt that spark UDF is not efficient.
Is there a way to optimize this code? Please let me know. Thanks a lot!
Upvotes: 1
Views: 617
Reputation: 8410
For multiple filters
, you should do this.
filter_cols= ['account_id','phone_number','credit_card_number','credit_card_limit','transaction_code','amount']
final_df.filter(' and '.join([x+' is not null' for x in filter_cols]))
Upvotes: 1