user8167344
user8167344

Reputation: 353

How to parse url in spark sql(Scala)

I am using following function to parse url but it throws error,

val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
        .withColumn("host",parse_url($"url_col","HOST"))
        .withColumn("query",parse_url($"url_col","QUERY"))
        .show(false)

Error:

<console>:285: error: not found: value parse_url
               .withColumn("host",parse_url($"url_col","HOST"))
                                  ^
<console>:286: error: not found: value parse_url
               .withColumn("query",parse_url($"url_col","QUERY"))
                                   ^

Kindly Guide how to parse url into its different parts.

Upvotes: 2

Views: 9168

Answers (4)

Powers
Powers

Reputation: 19308

I created a library called bebe that exposes the parse_url functionality via the Scala API.

Suppose you have the following DataFrame:

+------------------------------------+---------------+
|some_string                         |part_to_extract|
+------------------------------------+---------------+
|http://spark.apache.org/path?query=1|HOST           |
|http://spark.apache.org/path?query=1|QUERY          |
|null                                |null           |
+------------------------------------+---------------+

Calculate the different parts of the URL:

df.withColumn("actual", bebe_parse_url(col("some_string"), col("part_to_extract")))
+------------------------------------+---------------+----------------+
|some_string                         |part_to_extract|actual          |
+------------------------------------+---------------+----------------+
|http://spark.apache.org/path?query=1|HOST           |spark.apache.org|
|http://spark.apache.org/path?query=1|QUERY          |query=1         |
|null                                |null           |null            |
+------------------------------------+---------------+----------------+

Upvotes: 0

T. Gawęda
T. Gawęda

Reputation: 16076

Answer by @Ramesh is correct, but you also might want some hacky way to use this function without SQL queries :)

Hack is in the fact, that "callUDF" function calls not only UDFs, but any available function.

So you can write:

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

b.withColumn("host", callUDF("parse_url", $"url_col", lit("HOST"))).
 withColumn("query", callUDF("parse_url", $"url_col", lit("QUERY"))).
 show(false)

Edit: after this Pull Request is merged, you can just use parse_url like a normal function. PR made after this question :)

Upvotes: 5

antonpuz
antonpuz

Reputation: 3316

As mentioned before, when you register a UDF you don't get a Java function, rather you introduce it to Spark, so you must call it in the "Spark-way".

I want to suggest another method I find convenient, especially when there are several columns you want to add, by using selectExpr

val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")
val c = b.selectExpr("*", "parse_url(url_col, 'HOST') as host", "parse_url(url_col, 'QUERY') as query")
c.show(false)

Upvotes: 4

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

parse_url is available as only sql and not as api . refer to parse_url

so you should be using it as a sql query and not as a function call through api

You should register the dataframe and use query as below

val b = Seq(("http://spark.apache.org/path?query=1"),("https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative")).toDF("url_col")

b.createOrReplaceTempView("temp")
spark.sql("SELECT url_col, parse_url(`url_col`, 'HOST') as HOST, parse_url(`url_col`,'QUERY') as QUERY from temp").show(false)

which should give you output as

+--------------------------------------------------------------------------------------------+-----------------+-------+
|url_col                                                                                     |HOST             |QUERY  |
+--------------------------------------------------------------------------------------------+-----------------+-------+
|http://spark.apache.org/path?query=1                                                        |spark.apache.org |query=1|
|https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/sql/#negative|people.apache.org|null   |
+--------------------------------------------------------------------------------------------+-----------------+-------+

I hope the answer is helpful

Upvotes: 5

Related Questions