How to update two columns with different values on the same condition in Pyspark?

I have a DataFrame with column a. I would like to create two additional columns (b and c) based on column a. I could solve this problem doing the same thing twice:

df = df.withColumn('b', when(df.a == 'something', 'x'))\
       .withColumn('c', when(df.a == 'something', 'y'))

I would like to avoid doing the same thing twice, as the condition on which b and c are updated are the same, and also there are a lot of cases for column a. Is there a smarter solution to this problem? Could "withColumn" accept multiple columns perhaps?

Upvotes: 4

Views: 2334

Answers (3)

samkart
samkart

Reputation: 6644

A struct is best suited in such a case. See below example.

spark.sparkContext.parallelize([('something',), ('foobar',)]).toDF(['a']). \
    withColumn('b_c_struct', 
               func.when(func.col('a') == 'something', 
                         func.struct(func.lit('x').alias('b'), func.lit('y').alias('c'))
                         )
               ). \
    select('*', 'b_c_struct.*'). \
    show()

# +---------+----------+----+----+
# |        a|b_c_struct|   b|   c|
# +---------+----------+----+----+
# |something|    {x, y}|   x|   y|
# |   foobar|      null|null|null|
# +---------+----------+----+----+

Just use a drop('b_c_struct') after the select to remove the struct column and keep the individual fields.

Upvotes: 2

Jonathan
Jonathan

Reputation: 2033

By using withColumn, you can only create or modify one column at each time. You can achieve by using rdd mapping with user defined functions, however it's not recommended:

temp = spark.createDataFrame(
    [(1, )],
    schema=['col']
)

temp.show(10, False)
+---+
|col|
+---+
|1  |
+---+



#You can create your own logic in your UDF
def user_defined_function(val, col_name):
    if col_name == 'col2':
        val += 1
    elif col_name == 'col3':
        val += 2
    else:
        pass

    return val

temp = temp.rdd.map(lambda row: (row[0], user_defined_function(row[0], 'col2'), user_defined_function(row[0], 'col3'))).toDF(['col', 'col2', 'col3'])
temp.show(3, False)
+---+----+----+
|col|col2|col3|
+---+----+----+
|1  |2   |3   |
+---+----+----+

Upvotes: 1

If i correctly understood, you want to create multiple columns with a unique withColumn call ? If this is the case, you won't be able to do this.

Based on the official documentation, withColumn Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

You have the following options to add multiple columns:

1.

df = df.withColumn('b', when(df.a == 'something', 'x'))\
      .withColumn('c', when(df.a == 'something', 'y'))
df = df.withColumn('b', when(df.a == 'something', 'x'))
df = df.withColumn('c', when(df.a == 'something', 'y'))

Just to mention, you need to add else .otherwise after .when to specify what value to replace when the condition is not met.

Upvotes: 0

Related Questions