Reputation: 188
I have a DataFrame of Int, Array[Int]
with the values of
+---+------+
| _1| _2|
+---+------+
| 1| [1]|
| 1| [2]|
| 2|[3, 4]|
+---+------+
I want to return DataFrame of
+---+------+------------------+
| _1| _2| _3|
+---+------+------------------+
| 1| [1]| [hash(1)]|
| 1| [2]| [hash(2)]|
| 2|[3, 4]|[hash(3), hash(4)]|
+---+------+------------------+
I originally attempted to convert the DataFrame into a dataset and to map the dataset. However, I am unable to reproduce the hash with MurmurHash3. In short, I am unable to reproduce https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2165-L2168.
Any ideas on how to proceed?
I am open to any method to get my desired result.
Upvotes: 0
Views: 41
Reputation:
Use transform
:
val df = Seq((1, Seq(1)), (1, Seq(2)), (2, Seq(3, 4))).toDF
df.selectExpr("*", "transform(_2, x -> hash(x)) AS _3").show
+---+------+--------------------+
| _1| _2| _3|
+---+------+--------------------+
| 1| [1]| [-559580957]|
| 1| [2]| [1765031574]|
| 2|[3, 4]|[-1823081949, -39...|
+---+------+--------------------+
Upvotes: 4