Reputation: 833
Assuming that I have the following Spark Dataframe:
+---+--------+-----+----+--------+
|c1 | c2 | c3 | c4 | c5 |
+---+--------+-----+----+--------+
| A| abc | 0.1 |null| 0.562 |
| B| def | 0.15| 0.5| 0.123 |
| A| ghi | 0.2 | 0.2| 0.1345|
| B| jkl | null| 0.1| 0.642 |
| B| mno | 0.1 | 0.1| null |
+---+--------+-----+----+--------+
How can I check whether all of the values in the last 3 columns are within range [0, 1]
if they are not null
?
Upvotes: 0
Views: 3120
Reputation: 39950
The following should do the trick:
from functools import reduce
import pyspark.sql.functions as F
import warnings
# Filter out valid values
test_df = df.where(reduce(lambda x, y: x | y, ((F.col(x) > 1) | (F.col(x) < 0) for x in df.columns[2:])))
if not len(test_df.head(1)) == 0:
test_df.show()
warnings.warn('Some of the values in the final dataframe were out of range')
Upvotes: 2