Wing
Wing

Reputation: 21

Using Spark to extract part of the string in CSV format

Spark newbie here and hopefully you guys can give me some help. Thanks!

I am trying to extract a URL from a CSV file and the URL is located at the 16th column. The problem is that the URLs were written in a strange format as you can see from the print out from the code below. What is the best approach to get a the URL in correct format?

case class log(time_stamp: String, url: String )

val logText = sc.textFile("hdfs://...").map(s => s.split(",")).map( s => log(s(0).replaceAll("\"", ""),s(15).replaceAll("\"", ""))).toDF()


logText.registerTempTable("log")

val results = sqlContext.sql("SELECT * FROM log")
results.map(s => "URL: " + s(1)).collect().foreach(println)

URL: /XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz
URL: /XX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz
URL: /XXX/YYY/ZZZ/http/www.domain.com/
URL: /VV/XXXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz

Upvotes: 0

Views: 3485

Answers (2)

marios
marios

Reputation: 8996

The question comes down to parsing the long strings and extracting the domain name. This solution will work as long as you don't have any of the random strings (XXX,YYYYY, etc.) be "http" and "https":

def getUrl(data: String): Option[String] = {
   val slidingPairs = data.split("/").sliding(2)
   slidingPairs.flatMap{ 
       case Array(x,y) => 
          if(x=="http" || x == "https") Some(y) else None
    }.toList.headOption
}

Here are some examples in the REPL:

scala> getUrl("/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz")
res8: Option[String] = Some(www.domain.com)

scala> getUrl("/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)

scala> getUrl("/XXX/YYY/ZZZ/a/asdsd/asdase/123123213/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)

Upvotes: 0

zero323
zero323

Reputation: 330093

You can try regexp_replace:

import org.apache.spark.sql.functions.regexp_replace

val df = sc.parallelize(Seq(
  (1L, "/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz"),
  (2L, "/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
)).toDF("id", "url")

df
  .select(regexp_replace($"url", "^(/\\w+){3}/(https?)/", "$2://").alias("url"))
  .show(2, false)

// +--------------------------------------+
// |url                                   |
// +--------------------------------------+
// |http://www.domain.com/xyz/xyz         |
// |https://sub.domain.com/xyz/xyz/xyz/xyz|
// +--------------------------------------+

In Spark 1.4 you can try Hive UDF:

df.selectExpr("""regexp_replace(url, '^(/\w+){3}/(https?)/','$2://') AS url""")

If number of sections before http(s) can vary you adjust regexp by replacing {3} with * or range.

Upvotes: 6

Related Questions