Reputation: 399
I want to filter the null values in the input dataframe dynamically, since the value columns can be from value(1) ... value(n)
Input Dataframe:
|dim1|dim2|TR_LIST|value1|value2|
| 101| 201|MTD0001| 1| 21|
| 101| 201|MTD0001| null| null|
| 101| 201|MTD0001| null| null|
| 102| 202|MTD0002| null| null|
| 102| 202|MTD0002| 2| 22|
| 102| 202|MTD0002| null| null|
| 103| 203|MTD0003| null| null|
| 103| 203|MTD0003| null| null|
| 103| 203|MTD0003| 3| 23|
Output Dataframe:
|dim1|dim2|TR_LIST|value1|value2|
| 101| 201|MTD0001| 1| 21|
| 102| 202|MTD0002| 2| 22|
| 103| 203|MTD0003| 3| 23|
I have tried filtering using the for loop in filter method of dataframe but getting error like (it should be either str or column).
Command line I tried to use:
list_valiue = [value1, value2] #Here i will be passing value columns as a list
df.filter(df.value.isNotNull() for value in list_value) #Throwing an error
I also tried another way:
df.where(" AND ".join([df.%s.isNotNull())"%(li) for li in list_value])).show() # Even here I'm getting an error
Please provide any mistake I might have made in the above queries
Upvotes: 0
Views: 493
Reputation: 36
You can compose the filters doing something like that:
from pyspark.sql import functions as sf
columns_to_check = ['value1', 'value2']
dynamic_filter = sf.col(columns_to_check[0]).isNotNull()
for column in columns_to_check[1:]:
dynamic_filter = dynamic_filter & sf.col(column).isNotNull()
df.filter(dynamic_filter)
Upvotes: 0
Reputation: 6226
I don't understand what you mean by dynamically in this example but null
can be deleted using .isNull
(that you seem to know) or dropna
.
The second approach seems more appropriate when there are several columns to inspect
df = df.dropna(subset= ["value1", "value2"])
Default method is any
but you can also use all
. See the doc
More generally, you can as many columns as needed thanks to the subset
argument.
Please let me know if I misunderstood your problem
Upvotes: 1