Akki
Akki

Reputation: 37

generate hash key (unique identifier column in dataframe) in spark dataframe

I have table consisting > 100k rows. I need to generate unique id from the concatenated columns which will be unique. I have tried md5 function, which works fine for less data, but for huge data values are getting duplicated. Can you please provide any alternate solution or how i can make it proper to 100k data rows.

val df = Seq(
              ("Veg", "tomato", 1.99),
              ("Veg", "potato", 0.45),
              ("Fruit", "apple", 0.99),
              ("Fruit", "pineapple", 2.59)
               ).toDF("Type", "Item", "Price")

df.withColumn("hash_value",md5(concat($"Type",$"Item",$"Price"))).show(false)

+-----+---------+-----+--------------------------------+
|Type |Item     |Price|hash_value                      |
+-----+---------+-----+--------------------------------+
|Veg  |tomato   |1.99 |82215bc9c2078d2f1e773ad62b4f88c6|
|Veg  |potato   |0.45 |5c68bcadcbfdedf8b8c6edca20fd5126|
|Fruit|apple    |0.99 |830a70f1c16f015aa54ca24d9ea6ce0b|
|Fruit|pineapple|2.59 |1f0974817391905d41224e76735fc5d4|
+-----+---------+-----+--------------------------------+

How to create unique identifier for the concatenated column?

Upvotes: 2

Views: 13197

Answers (3)

Roozbeh
Roozbeh

Reputation: 732

If you want to generate hash key and at the same time deal with columns containing null value do as follow: use concat_ws

import pyspark.sql.functions as F
df = df.withColumn(
    "ID",
    F.sha2(
        F.concat_ws("", *(
            F.col(c).cast("string")
            for c
                in df.columns
        )),
        256
    )
)

Upvotes: 1

Stanley Cruvinel
Stanley Cruvinel

Reputation: 31

Syntax in Scala:

if necessary normalize strings with uppercase and trim:

'trim()' and 'upper()'

...
sha2(upper(
          trim(
              concat(...)
              )
           ),256
     )
...

The solution in scala will be like that:

import org.apache.spark.sql.functions._

val df = Seq(
              ("Veg", "tomato", 1.99),
              ("Veg", "potato", 0.45),
              ("Fruit", "apple", 0.99),
              ("Fruit", "pineapple", 2.59)
               ).toDF("Type", "Item", "Price")

df.withColumn("hash_value",sha2(concat($"Type",$"Item",$"Price"),256)).show(false)


+-----+---------+-----+----------------------------------------------------------------+
|Type |Item     |Price|hash_value                                                      |
+-----+---------+-----+----------------------------------------------------------------+
|Veg  |tomato   |1.99 |a9770f4457e613bcd20ebb47d50e374f9b7de6d0ff0d19e4f246b09a9505af67|
|Veg  |potato   |0.45 |367e976594aa71d73d69cc5ca2e4ab523ec64be88210f5b61da76f92142770c3|
|Fruit|apple    |0.99 |da52a8e67132cf43a8fe39e3e97ed5759e9aa3f16140320a0f82664e10eea136|
|Fruit|pineapple|2.59 |df0353ab978049e27e8ff809a804b452b0f3b06321b494b903aea432fa4203c2|
+-----+---------+-----+----------------------------------------------------------------+

Upvotes: 0

Steven
Steven

Reputation: 15283

I advice you sha 256 or 512 to avoid colision :

df.withColumn(
    "hash_value",
    F.sha2(
        F.concat(*(
            F.col(col).cast("string")
            for col 
            in df.columns
        )),
        256
    )
).show()

Upvotes: 2

Related Questions