Reputation: 83
I have a Pyspark Dataframe like this:
+--------+----+
| col1|col2|
+--------+----+
| Apple| A|
| Google| G|
|Facebook| F|
+--------+----+
I have an array with values ["SFO","LA","NYC"]. I want to add this array to the Dataframe as a new column, like this:
#+--------+----+--------------+
#| col1|col2| col3|
#+--------+----+--------------+
#| Apple| A|SFO |
#| Google| G|LA |
#|Facebook| F|NYC |
#+--------+----+--------------+
How to do that in Pyspark? I am not supposed to use Pandas in my solution.
Upvotes: 1
Views: 2419
Reputation: 8410
You can use array
function and star *
expand your list in it with lit
to put ur list in every row of a new column. Then, you can use a row_number()
calculation to send the result of that to element_at
. (Spark2.4+)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().orderBy("col3")
arr=["SFO","LA","NYC"]
df.withColumn("col3", F.array(*[F.lit(x) for x in arr]))\
.withColumn("rownum", F.row_number().over(w))\
.withColumn("col3", F.expr("""element_at(col3,rownum)""")).drop("rownum").show()
#+--------+----+----+
#| col1|col2|col3|
#+--------+----+----+
#| Apple| A| SFO|
#| Google| G| LA|
#|Facebook| F| NYC|
#+--------+----+----+
Upvotes: 2