Reputation: 127
How to pass array list(multiple column) instead of single column in pyspark using this command:
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
eg:-
I used this code for removing garbage value(#,$) into single column
filter_list = ['##', '$']
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
In this example 'color' is column.
But I want to remove garbage(#,##,$,$$$) value with multiple occurrances into multiple column.
Sample Input:-
id name Salary
# Yogita 3000
2 Bhavana 5000
$$ ### 7000
%$4# Neha $$$$
Sample Output:-
id name salary
2 Bhavana 5000
Anybody help me,
Thanks in advance,
Yogita
Upvotes: 0
Views: 274
Reputation: 2696
Here is an answer using a user-defined function:
from pyspark.sql.types import *
from itertools import chain
filter_list = ['#','##', '$', '$$$']
def filterfn(*x):
booleans=list(chain(*[[filter not in elt for filter in filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, booleans, True))
filter_udf=f.udf(filterfn, BooleanType())
new_df.filter(filter_udf(*[col for col in new_df.columns])).show(10)
Upvotes: 1