Younes
Younes

Reputation: 395

SparkSQL split using Regex

I'm trying to split a line into an array using a regular expression. My line contains an apache log and I'm looking to split using sql.

I tried split and array function, but nothing.

Select split('10.10.10.10 - - [08/Sep/2015:00:00:03 +0000] "GET /index.html HTTP/1.1" 206 - - "Apache-HttpClient" -', '^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^"]+)" \d+ - - "([^"]+)".*')
;

I'm expecting an array with 6 elements

Thanks

Upvotes: 2

Views: 8435

Answers (1)

zero323
zero323

Reputation: 330073

SPLIT function, as you can guess, splits string on a pattern. Since pattern string you provide matches a whole input there is nothing to return. Hence an empty array.

import org.apache.spark.sql.functions.{regexp_extract, array}

val pattern = """^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^"]+)" \d+ - - "([^"]+)".*"""

val df = sc.parallelize(Seq((
  1L, """10.10.10.10 - - [08/Sep/2015:00:00:03 +0000] "GET /index.html HTTP/1.1" 206 - - "Apache-HttpClient" -"""
))).toDF("id", "log")

What you need here is regex_extract:

val exprs = (1 to 6).map(i => regexp_extract($"log", pattern, i).alias(s"_$i"))

df.select(exprs:_*).show
// +-----------+---+---+--------------------+--------------------+-----------------+
// |         _1| _2| _3|                  _4|                  _5|               _6|
// +-----------+---+---+--------------------+--------------------+-----------------+
// |10.10.10.10|  -|  -|08/Sep/2015:00:00...|GET /index.html H...|Apache-HttpClient|
// +-----------+---+---+--------------------+--------------------+-----------------+

or for example an UDF:

val extractFromLog = udf({
  val ip = new Regex(pattern)
  (s: String) => s match {
    // Lets ignore some fields for simplicity
    case ip(ip, _, _, ts, request, client) => 
      Some(Array(ip, ts, request, client))
    case _ => None
  }
})

df.select(extractFromLog($"log"))

Upvotes: 3

Related Questions