Reputation: 51
I have two dataframes. The first dataframe has an array as its value for column2
and I want to join it with the second dataframe so the numerical values are mapped to their string values. The order of the elements should stay the same since they correspond to the array elements in column3
by index.
df_one
:
column1| column2| column3
----------------------------------
"thing1"|[1,2,3..]|[0.1,0.2,0.3..]
"thing2"|[1,2,3..]|[0.1,0.2,0.3..]
"thing3"|[1,2,3..]|[0.1,0.2,0.3..]
...
df_two
:
columnA|columnB
---------------
1|"item1"
2|"item2"
3|"item3"
...
Is there a way to join these dataframes and select the columns like so:
column1 | newColumn| column3
----------------------------------------------------
"thing1"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing2"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
"thing3"|["item1","item2","item3"..]|[0.1,0.2,0.3..]
...
Upvotes: 4
Views: 962
Reputation: 10406
As mentioned in the comments, explode
on column2
and then join
on columnA
is the good way to go. Yet, when you group back the data, I am not sure that the order would always be preserved.
To be sure, and avoid costly UDF in python, you could use posexplode
to keep track of the position of each element, and then an ordered window function to build the list back:
df_one = spark.createDataFrame([("thing1", [1, 2, 3], "X"), ("thing2", [1, 2, 3], "Y"), ("thing3", [1, 2, 3], "Z")],
["column1", "column2", "column3"])
df_two = spark.createDataFrame([(1, "item1"), (2, "item2"), (3, "item3")],
["columnA", "columnB"])
w = Window.partitionBy("column1").orderBy("pos")
df_one\
.select("*", f.posexplode("column2").alias("pos", "columnA"))\
.join(df_two, ['columnA'])\
.withColumn("newColumn", f.collect_list("columnB").over(w))\
.where(f.col("pos")+1 == f.size(f.col("column2")))\
.select("column1", "newColumn", "column3")\
.show(truncate=False)
+-------+---------------------+-------+
|column1|newColumn |column3|
+-------+---------------------+-------+
|thing1 |[item1, item2, item3]|X |
|thing2 |[item1, item2, item3]|Y |
|thing3 |[item1, item2, item3]|Z |
+-------+---------------------+-------+
Upvotes: 2