Spark Scala Regex - Creating multiple columns based on regex

Question

Let's say I have a text file with data like such..

my "sample data set" kdf/dfjl/ looks like this

I have a regular expression that can capture all of this into groups. The values I'd like put into my columns would be like this.

desired values from groups

I'd like each group to become it's own column in an rdd

val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r

var myrdd = sc.textFile("my/data/set.txt")
myrdd.map(line => pattern.findAllIn(line))

I've tried several different methods for getting the matches from the regex out into different columns, like toArray, toSeq, but haven't even come close yet.

I'm aware of how the data exists inside the matches....

val answer = pattern.findAllIn(line).matchData
for(m <- answer){
  for(e <- m.subgroups){
    println(e)
  }
}

It's those 'e's that I'm after.. but not having much luck getting that data separated out into my RDD.

Thanks

Leo C · Accepted Answer

I would suggest using for-comprehension, rather than for-loop, to generate a list of extracted groups per line and map the list elements into individual columns:

val rdd = sc.textFile("/path/to/textfile")

val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r

rdd.map{ line =>
    ( for {
        m <- pattern.findAllIn(line).matchData
        g <- m.subgroups
      } yield(g)
    ).toList
  }.
  map(l => (l(0), l(1), l(2), l(3), l(4), l(5)))
// org.apache.spark.rdd.RDD[(String, String, String, String, String, String)] = ...

Spark Scala Regex -> Creating multiple columns based on regex

Answers (1)

Related Questions

Spark Scala Regex -&gt; Creating multiple columns based on regex

Answers (1)

Related Questions

Spark Scala Regex -> Creating multiple columns based on regex