Reputation: 832
I am trying to extract domains from URLs.
Input:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("parse_url", $"raw_url", lit("HOST"))).show()
Expected results:
+--------------------------------+---------------+
| raw_url | host |
+--------------------------------+---------------+
| subdomain.example.com/test.php | example.com |
| example.com | example.com |
| example.buzz | example.buzz |
| test.example.buzz | example.buzz |
| subdomain.example.co.uk | example.co.uk |
+------------------------------- +---------------+
Any advice much appreciated.
EDIT: based on the tip from @AlexOtt I have got a few steps closer.
import com.google.common.net.InternetDomainName
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
("example.buzz"),
("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()
However, I clearly have not implemented it correctly with withColumn. Here is the error:
error: not found: value topPrivateDomain var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()
EDIT 2:
Got some good pointers from @sarveshseri and after cleaning up some syntax errors, the following code is able to remove the subdomains from most of the URLs.
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import com.google.common.net.InternetDomainName
import java.net.URL
val b = Seq(
("subdomain.example.com/test.php"),
("example.com"),
//("example.buzz"),
//("test.example.buzz"),
("subdomain.example.co.uk"),
).toDF("raw_url")
val hostExtractUdf = org.apache.spark.sql.functions.udf {
(urlString: String) =>
val url = new URL("https://" + urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().name()
}
var c = b.select("raw_url").withColumn("HOST",
hostExtractUdf(col("raw_url")))
.show(false)
However, it still does not work as expected. Newer suffixes like .buzz
and .site
and .today
cause the following error:
Caused by: java.lang.IllegalStateException: Not under a public suffix: example.buzz
Upvotes: 0
Views: 1658
Reputation: 32670
Maybe you can use regex with Spark regexp_extract
and regexp_replace
built-in functions. Here's an example:
val c = b.withColumn(
"HOST",
regexp_extract(col("raw_url"), raw"^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www.)?([^:\/\n?]+)", 1)
).withColumn(
"sub_domain",
regexp_extract(col("HOST"), raw"(.*?)\.(?=[^\/]*\..{2,5})/?.*", 1)
).withColumn(
"HOST",
expr("trim(LEADING '.' FROM regexp_replace(HOST, sub_domain, ''))")
).drop("sub_domain")
c.show(false)
//+-----------------------------------+-------------+
//|raw_url |HOST |
//+-----------------------------------+-------------+
//|subdomain.example.com/test.php |example.com |
//|example.com |example.com |
//|example.buzz |example.buzz |
//|test.example.buzz |example.buzz |
//|https://www.subdomain.example.co.uk|example.co.uk|
//|subdomain.domain.buzz |domain.buzz |
//|dev.example.today |example.today|
//+-----------------------------------+-------------+
The first one extracts the the full host name from the URL (including the subdomain). Then, using the regex taken from this answer, we search for the subdomain and replace it with blank.
Didn't test it for all possible cases but it works fine for the given examples in your question.
Upvotes: 2
Reputation: 13985
First you will need to add guava
to dependencies in build.sbt
.
libraryDependencies += "com.google.guava" % "guava" % "31.0.1-jre"
Now you can extract the host as follows,
import com.google.common.net.InternetDomainName
import org.apache.sedona.core.serde.SedonaKryoRegistrator
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import java.net.URL
import spark.implicits._
val hostExtractUdf = org.apache.spark.sql.functions.udf { (urlString: String) =>
val url = new URL("https://" + urlString)
val host = url.getHost
InternetDomainName.from(host).topPrivateDomain().toString
}
val b = sc.parallelize(Seq(
("a.b.com/c.php"),
("a.b.site/c.php"),
("a.b.buzz/c.php"),
("a.b.today/c.php"),
("b.com"),
("b.site"),
("b.buzz"),
("b.today"),
("a.b.buzz"),
("a.b.co.uk"),
("a.b.site")
)).toDF("raw_url")
val c = b.withColumn("HOST", hostExtractUdf(col("raw_url")))
c.show()
c.show
output
+---------------+-------+
| raw_url| HOST|
+---------------+-------+
| a.b.com/c.php| b.com|
| a.b.site/c.php| b.site|
| a.b.buzz/c.php| b.buzz|
|a.b.today/c.php|b.today|
| b.com| b.com|
| b.site| b.site|
| b.buzz| b.buzz|
| b.today|b.today|
| a.b.buzz| b.buzz|
| a.b.co.uk|b.co.uk|
| a.b.site| b.site|
+---------------+-------+
Upvotes: 3