Spark UDF for sequence Generator

Question

All,

I am trying to create UDF for spark dataframe, which will be used for generating the unique ID per row. To ensure uniqueness, I am relying on: ID Generator will take "epoch value of timestamp ( bigint ) + "unique Source ID passed as argument + randomNumber 5 digit

I have 2 questions :

how to include monotonically_increasing_id() during id generation function "idGenerator"
while using the UDF, failing for below error :

    error:Type mismatch;
    found : String(SRC_")
    required : org.apache.spark.sql.Column
            df.withColumn("rowkey",SequenceGeneratorUtil.GenID("SRC_") )

Please provide any pointer ...

Object SequenceGeneratorUtil extends Serializable {

    val random = new scala.util.Random
    val start = 10000
    val end = 99999

    //CustomEpochGenerator - this is custom function to generate the epoch value for current timestamp in milliseconds
    // ID Generator will take "epoch value of timestamp ( bigint ) + "unique Source ID passed as argument + randomNumber 5 digit
    def idGenerator(SrcIdentifier: String ): String = SrcIdentifier + CustomEpochGenerator.nextID.toString + (start + random.nextInt((end - start) + 1)).toString // + monotonically_increasing_id ( not working )

    val GenID = udf[String, String](idGenerator __)

}

val df2 = df.withColumn("rowkey",SequenceGeneratorUtil.GenID("SRC_") )

Spark UDF for sequence Generator

Answers (1)

Related Questions