Reputation:
I'm trying to extract the last set number from this data type:
urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)
In this example I'm trying to extract 10342800535
as a string.
This is my code in scala,
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
This is run as an UDF and it throws the following error,
org.apache.spark.SparkException: Failed to execute user defined function
What am I doing wrong?
Upvotes: 0
Views: 612
Reputation: 5068
One reason that org.apache.spark.SparkException: Failed to execute user defined function
exception are raised is when an exception is raised inside your user defined function.
If I try to run your user defined function with the example input you provided, using the code below:
import org.apache.spark.sql.functions.{col, udf}
import sparkSession.implicits._
val dataframe = Seq("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)").toDF("urn")
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
val extract_urn = udf(extractNestedUrn _)
dataframe.select(extract_urn(col("urn"))).show(false)
I get this complete stack trace:
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(UdfExtractionError$$$Lambda$1165/1699756582: (string) => string)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
...
at UdfExtractionError$.main(UdfExtractionError.scala:37)
at UdfExtractionError.main(UdfExtractionError.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
at UdfExtractionError$.extractNestedUrn$1(UdfExtractionError.scala:29)
at UdfExtractionError$.$anonfun$main$4(UdfExtractionError.scala:35)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
... 86 more
The important part of this stack trace is actually:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
This is the exception raised when executing your user defined function code.if we analyse your function code, you split two times the input by :
. The result of the first split is actually this array:
["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
and not this array:
["urn", "fb", "candidateHiringState", "(urn:fb:contract:187236028,10342800535)"]
So, if we execute the remaining statements of your function, you get:
val arr = ["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
val nested = "(urn"
val clean = "urn"
val subarr = ["urn"]
As at the next line you call the fourth element of the array subarr
that contains only one element, an ArrayOutOfBound
exception is raised and then Spark returns a SparkException
Although the best solution to your problem is obviously the previous answer with regexp_extract, you can correct your user defined function as below:
def extractNestedUrn(urn: String): String = {
val arr = urn.split(':') // split using character instead of string regexp
val nested = arr.last // get last element of array, here "187236028,10342800535)"
val subarr = nested.split(',')
val res = subarr.last // get last element, here "10342800535)"
val out = res.init // take all the string except the last character, to remove ')'
out // no need to use .toString as out is already a String
}
However, as said before, the best solution is to use spark inner function regexp_extract
as explained in first answer. Your code will be easier to understand and more performant
Upvotes: 0
Reputation: 8711
You can simply use regexp_extract function. Check this
val df = Seq(("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)")).toDF("x")
df.show(false)
+-------------------------------------------------------------------+
|x |
+-------------------------------------------------------------------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|
+-------------------------------------------------------------------+
df.withColumn("NestedUrn", regexp_extract(col("x"), """.*,(\d+)""", 1)).show(false)
+-------------------------------------------------------------------+-----------+
|x |NestedUrn |
+-------------------------------------------------------------------+-----------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|10342800535|
+-------------------------------------------------------------------+-----------+
Upvotes: 1