How to pass array(multiple column) in below code using pyspark

Question

How to pass array list(multiple column) instead of single column in pyspark using this command:

new_df = new_df.filter(new_df.color.isin(*filter_list) == False)

eg:-

I used this code for removing garbage value(#,$) into single column

filter_list = ['##', '$']

new_df = new_df.filter(new_df.color.isin(*filter_list) == False)

In this example 'color' is column.

But I want to remove garbage(#,##,$,$$$) value with multiple occurrances into multiple column.

Sample Input:-

id       name       Salary

#        Yogita     3000

2        Bhavana    5000

$$       ###        7000

%$4#     Neha       $$$$

Sample Output:-

 id         name       salary

 2        Bhavana      5000

Anybody help me,

Thanks in advance,

Yogita

ags29 · Accepted Answer

Here is an answer using a user-defined function:

from pyspark.sql.types import *
from itertools import chain

filter_list = ['#','##', '$', '$$$']
def filterfn(*x):
    booleans=list(chain(*[[filter not in elt for filter in filter_list] for elt in x]))
    return(reduce(lambda x,y: x and y, booleans, True))

filter_udf=f.udf(filterfn, BooleanType())
new_df.filter(filter_udf(*[col for col in new_df.columns])).show(10)

How to pass array(multiple column) in below code using pyspark

Answers (1)

Related Questions