bingitybang
bingitybang

Reputation: 51

PySpark Mapping Elements in Array within a Dataframe to another Dataframe

I have two dataframes. The first dataframe has an array as its value for column2 and I want to join it with the second dataframe so the numerical values are mapped to their string values. The order of the elements should stay the same since they correspond to the array elements in column3 by index.

df_one:

 column1|  column2|        column3
----------------------------------
"thing1"|[1,2,3..]|[0.1,0.2,0.3..]
"thing2"|[1,2,3..]|[0.1,0.2,0.3..]
"thing3"|[1,2,3..]|[0.1,0.2,0.3..]
...

df_two:

columnA|columnB
---------------
      1|"item1"
      2|"item2"
      3|"item3"
...

Is there a way to join these dataframes and select the columns like so:

column1 |                  newColumn|        column3
----------------------------------------------------
"thing1"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing2"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing3"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
...

Upvotes: 4

Views: 962

Answers (1)

Oli
Oli

Reputation: 10406

As mentioned in the comments, explode on column2 and then join on columnA is the good way to go. Yet, when you group back the data, I am not sure that the order would always be preserved.

To be sure, and avoid costly UDF in python, you could use posexplode to keep track of the position of each element, and then an ordered window function to build the list back:

df_one = spark.createDataFrame([("thing1", [1, 2, 3], "X"), ("thing2", [1, 2, 3], "Y"), ("thing3", [1, 2, 3], "Z")],
                               ["column1", "column2", "column3"])
df_two = spark.createDataFrame([(1, "item1"), (2, "item2"), (3, "item3")],
                               ["columnA", "columnB"])

w = Window.partitionBy("column1").orderBy("pos")

df_one\
    .select("*", f.posexplode("column2").alias("pos", "columnA"))\
    .join(df_two, ['columnA'])\
    .withColumn("newColumn", f.collect_list("columnB").over(w))\
    .where(f.col("pos")+1 == f.size(f.col("column2")))\
    .select("column1", "newColumn", "column3")\
    .show(truncate=False)
+-------+---------------------+-------+
|column1|newColumn            |column3|
+-------+---------------------+-------+
|thing1 |[item1, item2, item3]|X      |
|thing2 |[item1, item2, item3]|Y      |
|thing3 |[item1, item2, item3]|Z      |
+-------+---------------------+-------+

Upvotes: 2

Related Questions