srinin
srinin

Reputation: 83

In Pyspark, how to add a list of values as a new column to an existing Dataframe?

I have a Pyspark Dataframe like this:

+--------+----+
|    col1|col2|
+--------+----+
|   Apple|   A|
|  Google|   G|
|Facebook|   F|
+--------+----+

I have an array with values ["SFO","LA","NYC"]. I want to add this array to the Dataframe as a new column, like this:

#+--------+----+--------------+
#|    col1|col2|          col3|
#+--------+----+--------------+
#|   Apple|   A|SFO           |
#|  Google|   G|LA            |
#|Facebook|   F|NYC           |
#+--------+----+--------------+

How to do that in Pyspark? I am not supposed to use Pandas in my solution.

Upvotes: 1

Views: 2419

Answers (1)

murtihash
murtihash

Reputation: 8410

You can use array function and star * expand your list in it with lit to put ur list in every row of a new column. Then, you can use a row_number() calculation to send the result of that to element_at. (Spark2.4+)

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w=Window().orderBy("col3")
arr=["SFO","LA","NYC"]

df.withColumn("col3", F.array(*[F.lit(x) for x in arr]))\
  .withColumn("rownum", F.row_number().over(w))\
  .withColumn("col3", F.expr("""element_at(col3,rownum)""")).drop("rownum").show()

#+--------+----+----+
#|    col1|col2|col3|
#+--------+----+----+
#|   Apple|   A| SFO|
#|  Google|   G|  LA|
#|Facebook|   F| NYC|
#+--------+----+----+

Upvotes: 2

Related Questions