toofrellik
toofrellik

Reputation: 1307

spark scala - UDF usage for creating new column

I need to create a new column called hash_id from uid column of my dataframe, Below is my code:

//1.Define a hashing function
def calculate_hashid (uid: String) : BigInteger ={

      val md = java.security.MessageDigest.getInstance("SHA-1")
      val ha = new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
      return ha

    }

//2.Convert function to UDF
val  calculate_hashidUDF = udf(calculate_hashid)

//3.Apply udf on spark dataframe
val userAgg_Data_hashid = userAgg_Data.withColumn("hash_id", calculate_hashidUDF($"uid"))

I am getting error at udf(calculate_hashid) saying

missing arguments for the method calculate_hashid(string)

I have gone through many examples online but could not resolve it, what am I missing here.

Upvotes: 1

Views: 1355

Answers (1)

koiralo
koiralo

Reputation: 23109

You can register your udf as

val  calculate_hashidUDF = udf[String, BigInteger](calculate_hashidUDF)

You can also rewrite your udf as

def calculate_hashidUDF = udf(((uid: String) => {
  val md = java.security.MessageDigest.getInstance("SHA-1")
  new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
}): String => BigInteger)

Or even without return type

def calculate_hashidUDF = udf((uid: String) => {
  val md = java.security.MessageDigest.getInstance("SHA-1")
  new BigInteger( DatatypeConverter.printHexBinary(md.digest(uid.getBytes)), 16).mod(BigInteger.valueOf(10000))
})

Upvotes: 2

Related Questions