Reputation: 13
Let's say I have a text file with data like such..
my "sample data set" kdf/dfjl/ looks like this
I have a regular expression that can capture all of this into groups. The values I'd like put into my columns would be like this.
I'd like each group to become it's own column in an rdd
val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r
var myrdd = sc.textFile("my/data/set.txt")
myrdd.map(line => pattern.findAllIn(line))
I've tried several different methods for getting the matches from the regex out into different columns, like toArray, toSeq, but haven't even come close yet.
I'm aware of how the data exists inside the matches....
val answer = pattern.findAllIn(line).matchData
for(m <- answer){
for(e <- m.subgroups){
println(e)
}
}
It's those 'e's that I'm after.. but not having much luck getting that data separated out into my RDD.
Thanks
Upvotes: 1
Views: 667
Reputation: 22439
I would suggest using for-comprehension
, rather than for-loop, to generate a list of extracted groups per line and map the list elements into individual columns:
val rdd = sc.textFile("/path/to/textfile")
val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r
rdd.map{ line =>
( for {
m <- pattern.findAllIn(line).matchData
g <- m.subgroups
} yield(g)
).toList
}.
map(l => (l(0), l(1), l(2), l(3), l(4), l(5)))
// org.apache.spark.rdd.RDD[(String, String, String, String, String, String)] = ...
Upvotes: 2