Reputation: 781
I have the following spark dataframe -
+----+----+
|col1|col2|
+----+----+
| a| 1|
| b|null|
| c| 3|
+----+----+
Is there a way in spark API to detect if col2
contains, say, 3
? Please note that the answer should be just one indicator value - yes/no
- and not the set of records that have 3
in col2
.
Upvotes: 6
Views: 18494
Reputation: 1780
The PySpark recommended way of finding if a DataFrame contains a particular value is to use pyspak.sql.Column.contains
API. You can use a boolean value on top of this to get a True/False
boolean value.
For your example:
bool(df.filter(df.col2.contains(3)).collect())
#Output
>>>True
bool(df.filter(df.col2.contains(100)).collect())
#Output
>>>False
Source : https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.Column.contains.html
Upvotes: 7
Reputation: 4333
You can use when
as a conditional statement
from pyspark.sql.functions import when
df.select(
(when(col("col2") == '3', 'yes')
.otherwise('no')
).alias('col3')
)
Upvotes: 0
Reputation: 215137
By counting the number of values in col2
that are equal to 3
:
import pyspark.sql.functions as f
df.agg(f.expr('sum(case when col2 = 3 then 1 else 0 end)')).first()[0] > 0
Upvotes: 2