Reputation: 25
I have a simple question for PySpark hash function.
I have checked that in Scala, Spark uses murmur3hash based on Hash function in spark.
I want to know what algorithm is exactly used for hash function in PySpark (https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#hash).
Could anyone answer this question? I also want to know the code that says the algorithm used in PySpark hash function.
Upvotes: 1
Views: 3330
Reputation: 191
Please note that reproducing the hash values outside PySpark is not trivial, at least in python. PySpark uses an implementation of this algorithm which doesn't give the same result when the libraries are run in python.
Even Scala & PySpark's hash algorithms aren't directly compatible. The reason for this is explained in https://stackoverflow.com/a/46472986/10999642
So if reproducibility in python is important, you can use python's in-built hash function, like so:
udf_hash = F.udf(lambda val: hash(val), T.LongType())
df = df.withColumn("hash", udf_hash("<column name>"))
Upvotes: 2
Reputation: 42402
Pyspark is just a wrapper around the Scala Spark code. I believe it uses the same hash function as in Scala Spark.
In your link to the source code, you can see that it calls sc._jvm.functions.hash
, which essentially points to the equivalent function in the Scala source code (inside the "JVM").
Upvotes: 1