AJY
AJY

Reputation: 188

Hash the contents of an columntype Array[Int] individually

I have a DataFrame of Int, Array[Int] with the values of

+---+------+
| _1|    _2|
+---+------+
|  1|   [1]|
|  1|   [2]|
|  2|[3, 4]|
+---+------+

I want to return DataFrame of

+---+------+------------------+
| _1|    _2|                _3|
+---+------+------------------+
|  1|   [1]|         [hash(1)]|
|  1|   [2]|         [hash(2)]|
|  2|[3, 4]|[hash(3), hash(4)]|
+---+------+------------------+

I originally attempted to convert the DataFrame into a dataset and to map the dataset. However, I am unable to reproduce the hash with MurmurHash3. In short, I am unable to reproduce https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2165-L2168.

Any ideas on how to proceed?

I am open to any method to get my desired result.

Upvotes: 0

Views: 41

Answers (1)

user10957899
user10957899

Reputation:

Use transform:

val df = Seq((1, Seq(1)), (1, Seq(2)), (2, Seq(3, 4))).toDF

df.selectExpr("*", "transform(_2, x -> hash(x)) AS _3").show
+---+------+--------------------+
| _1|    _2|                  _3|
+---+------+--------------------+
|  1|   [1]|        [-559580957]|
|  1|   [2]|        [1765031574]|
|  2|[3, 4]|[-1823081949, -39...|
+---+------+--------------------+

Upvotes: 4

Related Questions