Wonko the Sane
Wonko the Sane

Reputation: 832

Extract domain from URLs using scala

I am trying to extract domains from URLs.

Input:

    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    val b = Seq(
        ("subdomain.example.com/test.php"),
        ("example.com"),
        ("example.buzz"),
        ("test.example.buzz"),
        ("subdomain.example.co.uk"),
    ).toDF("raw_url")
    var c = b.withColumn("host", callUDF("parse_url", $"raw_url", lit("HOST"))).show()

Expected results:

    +--------------------------------+---------------+
    | raw_url                        | host          |
    +--------------------------------+---------------+
    | subdomain.example.com/test.php | example.com   |
    | example.com                    | example.com   | 
    | example.buzz                   | example.buzz  |
    | test.example.buzz              | example.buzz  |
    | subdomain.example.co.uk        | example.co.uk |
    +------------------------------- +---------------+

Any advice much appreciated.

EDIT: based on the tip from @AlexOtt I have got a few steps closer.

    import com.google.common.net.InternetDomainName
    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    val b = Seq(
        ("subdomain.example.com/test.php"),
        ("example.com"),
        ("example.buzz"),
        ("test.example.buzz"),
        ("subdomain.example.co.uk"),
    ).toDF("raw_url")
    var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()

However, I clearly have not implemented it correctly with withColumn. Here is the error:

error: not found: value topPrivateDomain var c = b.withColumn("host", callUDF("InternetDomainName.from", $"raw_url", topPrivateDomain)).show()

EDIT 2:

Got some good pointers from @sarveshseri and after cleaning up some syntax errors, the following code is able to remove the subdomains from most of the URLs.

    import org.apache.spark.sql.functions.udf
    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    import com.google.common.net.InternetDomainName
    import java.net.URL

    val b = Seq(
       ("subdomain.example.com/test.php"),
       ("example.com"),
       //("example.buzz"),
       //("test.example.buzz"),
       ("subdomain.example.co.uk"),
       ).toDF("raw_url")

    val hostExtractUdf = org.apache.spark.sql.functions.udf { 
        (urlString: String) =>
        val url = new URL("https://" + urlString)
        val host = url.getHost
        InternetDomainName.from(host).topPrivateDomain().name()
    }

    var c = b.select("raw_url").withColumn("HOST", 
       hostExtractUdf(col("raw_url")))
        .show(false)

However, it still does not work as expected. Newer suffixes like .buzz and .site and .today cause the following error:

Caused by: java.lang.IllegalStateException: Not under a public suffix: example.buzz

Upvotes: 0

Views: 1658

Answers (2)

blackbishop
blackbishop

Reputation: 32670

Maybe you can use regex with Spark regexp_extract and regexp_replace built-in functions. Here's an example:

val c = b.withColumn(
  "HOST",
  regexp_extract(col("raw_url"), raw"^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www.)?([^:\/\n?]+)", 1)
).withColumn(
  "sub_domain",
  regexp_extract(col("HOST"), raw"(.*?)\.(?=[^\/]*\..{2,5})/?.*", 1)
).withColumn(
  "HOST",
  expr("trim(LEADING '.' FROM regexp_replace(HOST, sub_domain, ''))")
).drop("sub_domain")

c.show(false)
//+-----------------------------------+-------------+
//|raw_url                            |HOST         |
//+-----------------------------------+-------------+
//|subdomain.example.com/test.php     |example.com  |
//|example.com                        |example.com  |
//|example.buzz                       |example.buzz |
//|test.example.buzz                  |example.buzz |
//|https://www.subdomain.example.co.uk|example.co.uk|
//|subdomain.domain.buzz              |domain.buzz  |
//|dev.example.today                  |example.today|
//+-----------------------------------+-------------+

The first one extracts the the full host name from the URL (including the subdomain). Then, using the regex taken from this answer, we search for the subdomain and replace it with blank.

Didn't test it for all possible cases but it works fine for the given examples in your question.

Upvotes: 2

sarveshseri
sarveshseri

Reputation: 13985

First you will need to add guava to dependencies in build.sbt.

libraryDependencies += "com.google.guava" % "guava" % "31.0.1-jre"

Now you can extract the host as follows,

import com.google.common.net.InternetDomainName
import org.apache.sedona.core.serde.SedonaKryoRegistrator
import org.apache.spark.SparkConf
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

import java.net.URL

import spark.implicits._

val hostExtractUdf = org.apache.spark.sql.functions.udf { (urlString: String) =>
  val url = new URL("https://" + urlString)
  val host = url.getHost
  InternetDomainName.from(host).topPrivateDomain().toString
}

val b = sc.parallelize(Seq(
  ("a.b.com/c.php"),
  ("a.b.site/c.php"),
  ("a.b.buzz/c.php"),
  ("a.b.today/c.php"),
  ("b.com"),
  ("b.site"),
  ("b.buzz"),
  ("b.today"),
  ("a.b.buzz"),
  ("a.b.co.uk"),
  ("a.b.site")
)).toDF("raw_url")

val c = b.withColumn("HOST", hostExtractUdf(col("raw_url")))

c.show()

c.show output

+---------------+-------+
|        raw_url|   HOST|
+---------------+-------+
|  a.b.com/c.php|  b.com|
| a.b.site/c.php| b.site|
| a.b.buzz/c.php| b.buzz|
|a.b.today/c.php|b.today|
|          b.com|  b.com|
|         b.site| b.site|
|         b.buzz| b.buzz|
|        b.today|b.today|
|       a.b.buzz| b.buzz|
|      a.b.co.uk|b.co.uk|
|       a.b.site| b.site|
+---------------+-------+

Upvotes: 3

Related Questions