Anirban Chakraborty
Anirban Chakraborty

Reputation: 781

How to find if a spark column contains a certain value?

I have the following spark dataframe -

+----+----+
|col1|col2|
+----+----+
|   a|   1|
|   b|null|
|   c|   3|
+----+----+

Is there a way in spark API to detect if col2 contains, say, 3? Please note that the answer should be just one indicator value - yes/no - and not the set of records that have 3 in col2.

Upvotes: 6

Views: 18494

Answers (3)

Anirban Saha
Anirban Saha

Reputation: 1780

The PySpark recommended way of finding if a DataFrame contains a particular value is to use pyspak.sql.Column.contains API. You can use a boolean value on top of this to get a True/False boolean value. For your example:

bool(df.filter(df.col2.contains(3)).collect())

#Output 
>>>True
bool(df.filter(df.col2.contains(100)).collect())

#Output 
>>>False

Source : https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.Column.contains.html

Upvotes: 7

tourist
tourist

Reputation: 4333

You can use when as a conditional statement

from pyspark.sql.functions import when
df.select(
            (when(col("col2") == '3', 'yes')
            .otherwise('no')
            ).alias('col3')
          )

Upvotes: 0

akuiper
akuiper

Reputation: 215137

By counting the number of values in col2 that are equal to 3:

import pyspark.sql.functions as f
df.agg(f.expr('sum(case when col2 = 3 then 1 else 0 end)')).first()[0] > 0

Upvotes: 2

Related Questions