TiTo
TiTo

Reputation: 865

PySpark equivalent to pandas .isin()

I have the following PySpark DataFrame

data = [
    ('foo'), 
    ('baz'), 
    ('bar'), 
    ('qux')
]
df = spark.createDataFrame(data, ( "group"))

Now I want to create a new column number that is 0 if group is in the list zeros = ['baz', 'qux'], 1 if it is in ones = ['foo'] and 2 otherwise. In pandas I'd use .isin() but I don't understand how to solve that in PySpark.

Here is what I've tried but it does not work

df.withColumn("number", 
                func.when(func.col("group")  == array(*[lit(x) for x in ones])), 1)
               .otherwise(2))

Upvotes: 0

Views: 1676

Answers (1)

mck
mck

Reputation: 42352

You can also use isin in Pyspark. See the syntax below:

import pyspark.sql.functions as F

zeros = ['baz', 'qux']
ones = ['foo']

df2 = df.withColumn('number',
    F.when(F.col('group').isin(zeros), 0)
     .when(F.col('group').isin(ones), 1)
     .otherwise(2)
)

df2.show()
+-----+------+
|group|number|
+-----+------+
|  foo|     1|
|  baz|     0|
|  bar|     2|
|  qux|     0|
+-----+------+

Upvotes: 3

Related Questions