Reputation: 21
Spark newbie here and hopefully you guys can give me some help. Thanks!
I am trying to extract a URL from a CSV file and the URL is located at the 16th column. The problem is that the URLs were written in a strange format as you can see from the print out from the code below. What is the best approach to get a the URL in correct format?
case class log(time_stamp: String, url: String )
val logText = sc.textFile("hdfs://...").map(s => s.split(",")).map( s => log(s(0).replaceAll("\"", ""),s(15).replaceAll("\"", ""))).toDF()
logText.registerTempTable("log")
val results = sqlContext.sql("SELECT * FROM log")
results.map(s => "URL: " + s(1)).collect().foreach(println)
URL: /XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz
URL: /XX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz
URL: /XXX/YYY/ZZZ/http/www.domain.com/
URL: /VV/XXXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz
Upvotes: 0
Views: 3485
Reputation: 8996
The question comes down to parsing the long strings and extracting the domain name. This solution will work as long as you don't have any of the random strings (XXX,YYYYY, etc.) be "http" and "https":
def getUrl(data: String): Option[String] = {
val slidingPairs = data.split("/").sliding(2)
slidingPairs.flatMap{
case Array(x,y) =>
if(x=="http" || x == "https") Some(y) else None
}.toList.headOption
}
Here are some examples in the REPL:
scala> getUrl("/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz")
res8: Option[String] = Some(www.domain.com)
scala> getUrl("/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)
scala> getUrl("/XXX/YYY/ZZZ/a/asdsd/asdase/123123213/https/sub.domain.com/xyz/xyz/xyz/xyz")
resX: Option[String] = Some(sub.domain.com)
Upvotes: 0
Reputation: 330093
You can try regexp_replace
:
import org.apache.spark.sql.functions.regexp_replace
val df = sc.parallelize(Seq(
(1L, "/XXX/YYY/ZZZ/http/www.domain.com/xyz/xyz"),
(2L, "/XXX/YYY/ZZZ/https/sub.domain.com/xyz/xyz/xyz/xyz")
)).toDF("id", "url")
df
.select(regexp_replace($"url", "^(/\\w+){3}/(https?)/", "$2://").alias("url"))
.show(2, false)
// +--------------------------------------+
// |url |
// +--------------------------------------+
// |http://www.domain.com/xyz/xyz |
// |https://sub.domain.com/xyz/xyz/xyz/xyz|
// +--------------------------------------+
In Spark 1.4 you can try Hive UDF:
df.selectExpr("""regexp_replace(url, '^(/\w+){3}/(https?)/','$2://') AS url""")
If number of sections before http(s)
can vary you adjust regexp by replacing {3}
with *
or range.
Upvotes: 6