UDF to extract String in scala

Question

I'm trying to extract the last set number from this data type:

urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)

In this example I'm trying to extract 10342800535 as a string.

This is my code in scala,

def extractNestedUrn(urn: String): String = {
    val arr = urn.split(":").map(_.trim)
    val nested = arr(3)
    val clean = nested.substring(1, nested.length -1)
    val subarr = clean.split(":").map(_.trim)
    val res = subarr(3)
    val out = res.split(",").map(_.trim)
    val fin = out(1)
    fin.toString
  }

This is run as an UDF and it throws the following error,

org.apache.spark.SparkException: Failed to execute user defined function

What am I doing wrong?

Vincent Doba · Accepted Answer

One reason that org.apache.spark.SparkException: Failed to execute user defined function exception are raised is when an exception is raised inside your user defined function.

Analysis

If I try to run your user defined function with the example input you provided, using the code below:

import org.apache.spark.sql.functions.{col, udf}
import sparkSession.implicits._

val dataframe = Seq("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)").toDF("urn")

def extractNestedUrn(urn: String): String = {
  val arr = urn.split(":").map(_.trim)
  val nested = arr(3)
  val clean = nested.substring(1, nested.length -1)
  val subarr = clean.split(":").map(_.trim)
  val res = subarr(3)
  val out = res.split(",").map(_.trim)
  val fin = out(1)
  fin.toString
}

val extract_urn = udf(extractNestedUrn _)

dataframe.select(extract_urn(col("urn"))).show(false)

I get this complete stack trace:

Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(UdfExtractionError$$$Lambda$1165/1699756582: (string) => string)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
  at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
  ...
  at UdfExtractionError$.main(UdfExtractionError.scala:37)
  at UdfExtractionError.main(UdfExtractionError.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
  at UdfExtractionError$.extractNestedUrn$1(UdfExtractionError.scala:29)
  at UdfExtractionError$.$anonfun$main$4(UdfExtractionError.scala:35)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
  ... 86 more

The important part of this stack trace is actually:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 3

This is the exception raised when executing your user defined function code.if we analyse your function code, you split two times the input by :. The result of the first split is actually this array:

["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]

and not this array:

["urn", "fb", "candidateHiringState", "(urn:fb:contract:187236028,10342800535)"]

So, if we execute the remaining statements of your function, you get:

val arr = ["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
val nested = "(urn"
val clean = "urn"
val subarr = ["urn"]

As at the next line you call the fourth element of the array subarr that contains only one element, an ArrayOutOfBound exception is raised and then Spark returns a SparkException

Solution

Although the best solution to your problem is obviously the previous answer with regexp_extract, you can correct your user defined function as below:

def extractNestedUrn(urn: String): String = {
  val arr = urn.split(':') // split using character instead of string regexp
  val nested = arr.last // get last element of array, here "187236028,10342800535)"
  val subarr = nested.split(',')
  val res = subarr.last // get last element, here "10342800535)"
  val out = res.init // take all the string except the last character, to remove ')'
  out // no need to use .toString as out is already a String
}

However, as said before, the best solution is to use spark inner function regexp_extract as explained in first answer. Your code will be easier to understand and more performant

UDF to extract String in scala

Answers (2)

Analysis

Solution

Related Questions