Reputation: 868
I need to add a column to a dataFrame that is a hash of each row. The goal is to use this hash to uniquely identify this row. I will have upwards of 100,000,000 rows, so that is why the hash needs to be so large. I am aware of the built-in spark hash, but unfortunately it is only 32 bits, and would result in a very large number of hash collisions. How can I achieve this?
Upvotes: 0
Views: 1898
Reputation: 3071
You could use the built-in md5
function since it is 128 bits. But it doesn't accept multiple arguments so have to concat
the values together. Also need to handle different datatypes and null values.
import org.apache.spark.sql.functions.{col, md5, concat, cast, lit}
val tab_w_hash = tab.withColumn("hash128", md5(concat(tab.columns.map(x => coalesce(col(x).cast("string"),lit(""))) : _*)))
Upvotes: 1