Reputation: 247
I have a table with N
columns, I want to concatenate them all to a string column and then perform a hash on that column. I have found a similar question in Scala.
I want to do this entirely inside of Spark SQL ideally, I have tried HASH(*) as myhashcolumn
but due to several columns being sometimes null I can't make this work as I would expected.
If I have to create a UDF and register it to make this happen, I need to use Python and not Scala as all my other code is in Python.
Any ideas?
Upvotes: 4
Views: 8636
Reputation: 76
If you want to generate a hash based on all the columns of a DataFrame dynamically, you can use this:
import pyspark.sql.functions as F
df.withColumn("checksum", F.xxhash64(*df.schema.names))
Explanation:
df.schema.names
is a list with the names of all the columns in the DataFrame df
. Using a *
spreads this list into the elements it contains. You can then pass the elements to functions such as xxhash64 (for 64-bit hashes) and hash (for 32-bit hashes).
Upvotes: 6
Reputation: 3110
Try below code.
df.select([hash(col) for col in df.columns]).show()
Upvotes: 4
Reputation: 18838
You can do it in pyspark likes the following (just pass input columns to the function):
new_df = df.withColumn("contcatenated", hash_function(col("col1"), col("col2"), col("col3")))
Upvotes: 3