Reputation: 353
I am using following function to parse url but it throws error,
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
.withColumn("host",parse_url($"url_col","HOST"))
.withColumn("query",parse_url($"url_col","QUERY"))
.show(false)
Error:
<console>:285: error: not found: value parse_url
.withColumn("host",parse_url($"url_col","HOST"))
^
<console>:286: error: not found: value parse_url
.withColumn("query",parse_url($"url_col","QUERY"))
^
Kindly Guide how to parse url into its different parts.
Upvotes: 2
Views: 9168
Reputation: 19308
I created a library called bebe that exposes the parse_url
functionality via the Scala API.
Suppose you have the following DataFrame:
+------------------------------------+---------------+
|some_string |part_to_extract|
+------------------------------------+---------------+
|http://spark.apache.org/path?query=1|HOST |
|http://spark.apache.org/path?query=1|QUERY |
|null |null |
+------------------------------------+---------------+
Calculate the different parts of the URL:
df.withColumn("actual", bebe_parse_url(col("some_string"), col("part_to_extract")))
+------------------------------------+---------------+----------------+
|some_string |part_to_extract|actual |
+------------------------------------+---------------+----------------+
|http://spark.apache.org/path?query=1|HOST |spark.apache.org|
|http://spark.apache.org/path?query=1|QUERY |query=1 |
|null |null |null |
+------------------------------------+---------------+----------------+
Upvotes: 0
Reputation: 16076
Answer by @Ramesh is correct, but you also might want some hacky way to use this function without SQL queries :)
Hack is in the fact, that "callUDF" function calls not only UDFs, but any available function.
So you can write:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
b.withColumn("host", callUDF("parse_url", $"url_col", lit("HOST"))).
withColumn("query", callUDF("parse_url", $"url_col", lit("QUERY"))).
show(false)
Edit: after this Pull Request is merged, you can just use parse_url
like a normal function. PR made after this question :)
Upvotes: 5
Reputation: 3316
As mentioned before, when you register a UDF you don't get a Java function, rather you introduce it to Spark, so you must call it in the "Spark-way".
I want to suggest another method I find convenient, especially when there are several columns you want to add, by using selectExpr
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
val c = b.selectExpr("*", "parse_url(url_col, 'HOST') as host", "parse_url(url_col, 'QUERY') as query")
c.show(false)
Upvotes: 4
Reputation: 41957
parse_url
is available as only sql and not as api . refer to parse_url
so you should be using it as a sql query and not as a function call through api
You should register the dataframe and use query as below
val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
b.createOrReplaceTempView("temp")
spark.sql("SELECT url_col, parse_url(`url_col`, 'HOST') as HOST, parse_url(`url_col`,'QUERY') as QUERY from temp").show(false)
which should give you output as
+--------------------------------------------------------------------------------------------+-----------------+-------+
|url_col |HOST |QUERY |
+--------------------------------------------------------------------------------------------+-----------------+-------+
|http://spark.apache.org/path?query=1 |spark.apache.org |query=1|
|https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative|people.apache.org|null |
+--------------------------------------------------------------------------------------------+-----------------+-------+
I hope the answer is helpful
Upvotes: 5