Reputation: 105
I have a DataFrame with column a. I would like to create two additional columns (b and c) based on column a. I could solve this problem doing the same thing twice:
df = df.withColumn('b', when(df.a == 'something', 'x'))\
.withColumn('c', when(df.a == 'something', 'y'))
I would like to avoid doing the same thing twice, as the condition on which b and c are updated are the same, and also there are a lot of cases for column a. Is there a smarter solution to this problem? Could "withColumn" accept multiple columns perhaps?
Upvotes: 4
Views: 2334
Reputation: 6644
A struct
is best suited in such a case. See below example.
spark.sparkContext.parallelize([('something',), ('foobar',)]).toDF(['a']). \
withColumn('b_c_struct',
func.when(func.col('a') == 'something',
func.struct(func.lit('x').alias('b'), func.lit('y').alias('c'))
)
). \
select('*', 'b_c_struct.*'). \
show()
# +---------+----------+----+----+
# | a|b_c_struct| b| c|
# +---------+----------+----+----+
# |something| {x, y}| x| y|
# | foobar| null|null|null|
# +---------+----------+----+----+
Just use a drop('b_c_struct')
after the select
to remove the struct column and keep the individual fields.
Upvotes: 2
Reputation: 2033
By using withColumn
, you can only create or modify one column at each time. You can achieve by using rdd mapping with user defined functions, however it's not recommended:
temp = spark.createDataFrame(
[(1, )],
schema=['col']
)
temp.show(10, False)
+---+
|col|
+---+
|1 |
+---+
#You can create your own logic in your UDF
def user_defined_function(val, col_name):
if col_name == 'col2':
val += 1
elif col_name == 'col3':
val += 2
else:
pass
return val
temp = temp.rdd.map(lambda row: (row[0], user_defined_function(row[0], 'col2'), user_defined_function(row[0], 'col3'))).toDF(['col', 'col2', 'col3'])
temp.show(3, False)
+---+----+----+
|col|col2|col3|
+---+----+----+
|1 |2 |3 |
+---+----+----+
Upvotes: 1
Reputation: 1
If i correctly understood, you want to create multiple columns with a unique withColumn call ? If this is the case, you won't be able to do this.
Based on the official documentation, withColumn Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
You have the following options to add multiple columns:
1.
df = df.withColumn('b', when(df.a == 'something', 'x'))\
.withColumn('c', when(df.a == 'something', 'y'))
df = df.withColumn('b', when(df.a == 'something', 'x'))
df = df.withColumn('c', when(df.a == 'something', 'y'))
Just to mention, you need to add else .otherwise after .when to specify what value to replace when the condition is not met.
Upvotes: 0