Reputation: 865
I have the following PySpark DataFrame
data = [
('foo'),
('baz'),
('bar'),
('qux')
]
df = spark.createDataFrame(data, ( "group"))
Now I want to create a new column number
that is 0
if group
is in the list zeros = ['baz', 'qux']
, 1
if it is in ones = ['foo']
and 2
otherwise. In pandas I'd use .isin()
but I don't understand how to solve that in PySpark.
Here is what I've tried but it does not work
df.withColumn("number",
func.when(func.col("group") == array(*[lit(x) for x in ones])), 1)
.otherwise(2))
Upvotes: 0
Views: 1676
Reputation: 42352
You can also use isin
in Pyspark. See the syntax below:
import pyspark.sql.functions as F
zeros = ['baz', 'qux']
ones = ['foo']
df2 = df.withColumn('number',
F.when(F.col('group').isin(zeros), 0)
.when(F.col('group').isin(ones), 1)
.otherwise(2)
)
df2.show()
+-----+------+
|group|number|
+-----+------+
| foo| 1|
| baz| 0|
| bar| 2|
| qux| 0|
+-----+------+
Upvotes: 3