renjith
renjith

Reputation: 121

Adding multiple columns in pyspark dataframe using a loop

I need to add a number of columns (4000) into the data frame in pyspark. I am using the withColumn function, but getting assertion error.

df3 = df2.withColumn("['ftr' + str(i) for i in range(0, 4000)]", [expr('ftr[' + str(x) + ']') for x in range(0, 4000)])

Eror

Not sure what is wrong.

Upvotes: 6

Views: 13667

Answers (2)

Chiel
Chiel

Reputation: 2169

We can use .select() instead of .withColumn() to use a list as input to create a similar result as chaining multiple .withColumn()'s. The ["*"] is used to select also every existing column in the dataframe.

import pyspark.sql.functions as F

df2:

+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+

df3 = df2.select(["*"] + [F.lit(f"{x}").alias(f"ftr{x}") for x in range(0,10)])

Results in:

+---+----+----+----+----+----+----+----+----+----+----+
|age|ftr0|ftr1|ftr2|ftr3|ftr4|ftr5|ftr6|ftr7|ftr8|ftr9|
+---+----+----+----+----+----+----+----+----+----+----+
| 10|   0|   1|   2|   3|   4|   5|   6|   7|   8|   9|
| 11|   0|   1|   2|   3|   4|   5|   6|   7|   8|   9|
| 13|   0|   1|   2|   3|   4|   5|   6|   7|   8|   9|
+---+----+----+----+----+----+----+----+----+----+----+

Upvotes: 10

BICube
BICube

Reputation: 4681

Try to do something like this:

df2 = df3
for i in range(0, 4000):
  df2 = df2.withColumn(f"ftr{i}", lit(f"frt{i}"))

Upvotes: 2

Related Questions