Rocky1989
Rocky1989

Reputation: 399

Pyspark filter dataframe dynamically

I want to filter the null values in the input dataframe dynamically, since the value columns can be from value(1) ... value(n)

Input Dataframe:

|dim1|dim2|TR_LIST|value1|value2|
| 101| 201|MTD0001|     1|    21|
| 101| 201|MTD0001|  null|  null|
| 101| 201|MTD0001|  null|  null|
| 102| 202|MTD0002|  null|  null|
| 102| 202|MTD0002|     2|    22|
| 102| 202|MTD0002|  null|  null|
| 103| 203|MTD0003|  null|  null|
| 103| 203|MTD0003|  null|  null|
| 103| 203|MTD0003|     3|    23|

Output Dataframe:

|dim1|dim2|TR_LIST|value1|value2|
| 101| 201|MTD0001|     1|    21|
| 102| 202|MTD0002|     2|    22|
| 103| 203|MTD0003|     3|    23|

I have tried filtering using the for loop in filter method of dataframe but getting error like (it should be either str or column).

Command line I tried to use:

list_valiue = [value1, value2]  #Here i will be passing value columns as a list
df.filter(df.value.isNotNull() for value in list_value)  #Throwing an error

I also tried another way:

df.where(" AND ".join([df.%s.isNotNull())"%(li) for li in list_value])).show()  # Even here I'm getting an error 

Please provide any mistake I might have made in the above queries

Upvotes: 0

Views: 493

Answers (2)

Bruno Vilar
Bruno Vilar

Reputation: 36

You can compose the filters doing something like that:

from pyspark.sql import functions as sf

columns_to_check = ['value1', 'value2']

dynamic_filter = sf.col(columns_to_check[0]).isNotNull()

for column in columns_to_check[1:]:
    dynamic_filter = dynamic_filter & sf.col(column).isNotNull()

df.filter(dynamic_filter)

Upvotes: 0

linog
linog

Reputation: 6226

I don't understand what you mean by dynamically in this example but null can be deleted using .isNull (that you seem to know) or dropna.

The second approach seems more appropriate when there are several columns to inspect

df = df.dropna(subset= ["value1", "value2"])

Default method is any but you can also use all. See the doc

More generally, you can as many columns as needed thanks to the subset argument.

Please let me know if I misunderstood your problem

Upvotes: 1

Related Questions