CI L'OC
CI L'OC

Reputation: 25

What hash algorithm is used in pyspark.sql.functions.hash?

I have a simple question for PySpark hash function.

I have checked that in Scala, Spark uses murmur3hash based on Hash function in spark.

I want to know what algorithm is exactly used for hash function in PySpark (https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#hash).

Could anyone answer this question? I also want to know the code that says the algorithm used in PySpark hash function.

Upvotes: 1

Views: 3330

Answers (2)

utkarshgupta137
utkarshgupta137

Reputation: 191

Please note that reproducing the hash values outside PySpark is not trivial, at least in python. PySpark uses an implementation of this algorithm which doesn't give the same result when the libraries are run in python.

Even Scala & PySpark's hash algorithms aren't directly compatible. The reason for this is explained in https://stackoverflow.com/a/46472986/10999642

So if reproducibility in python is important, you can use python's in-built hash function, like so:

udf_hash = F.udf(lambda val: hash(val), T.LongType())
df = df.withColumn("hash", udf_hash("<column name>"))

Upvotes: 2

mck
mck

Reputation: 42402

Pyspark is just a wrapper around the Scala Spark code. I believe it uses the same hash function as in Scala Spark.

In your link to the source code, you can see that it calls sc._jvm.functions.hash, which essentially points to the equivalent function in the Scala source code (inside the "JVM").

Upvotes: 1

Related Questions