Zachary Oldham
Zachary Oldham

Reputation: 868

Create 128 bit hash of Spark row, store as new column

I need to add a column to a dataFrame that is a hash of each row. The goal is to use this hash to uniquely identify this row. I will have upwards of 100,000,000 rows, so that is why the hash needs to be so large. I am aware of the built-in spark hash, but unfortunately it is only 32 bits, and would result in a very large number of hash collisions. How can I achieve this?

Upvotes: 0

Views: 1898

Answers (1)

swdev
swdev

Reputation: 3071

You could use the built-in md5 function since it is 128 bits. But it doesn't accept multiple arguments so have to concat the values together. Also need to handle different datatypes and null values.

import org.apache.spark.sql.functions.{col, md5, concat, cast, lit}

val tab_w_hash = tab.withColumn("hash128", md5(concat(tab.columns.map(x => coalesce(col(x).cast("string"),lit(""))) : _*)))

Upvotes: 1

Related Questions