Reputation: 3184
I am trying to filter file data into good and bad data per the date, hence will get 2 result files. From test file, first 4 lines need to go in good data and last 2 lines in bad data.
I am having 2 issues
and bad data result looks like following - picking up the name characters only
(,C,h) (,J,u) (,T,h) (,J,o) (,N,e) (,B,i)
Test file
Christopher|Jan 11, 2017|5
Justin|11 Jan, 2017|5
Thomas|6/17/2017|5
John|11-08-2017|5
Neli|2016|5
Bilu||5
Load and RDD
scala> val file = sc.textFile("test/data.txt")
scala> val fileRDD = file.map(x => x.split("|"))
RegEx
scala> val singleReg = """(\w(3))\s(\d+)(,)\s(\d(4))|(\d+)\s(\w(3))(,)\s(\d(4))|(\d+)(\/)(\d+)(\/)(\d(4))|(\d+)(-)(\d+)(-)(\d(4))""".r
Is three " (double quotes) in the beginning and end and .r important here?
Filter issue area
scala> val validSingleRecords = fileRDD.filter(x => (singleReg.pattern.matcher(x(1)).matches))
scala> val badSingleRecords = fileRDD.filter(x => !(singleReg.pattern.matcher(x(1)).matches))
Turn array into string
scala> val validSingle = validSingleRecords.map(x => (x(0),x(1),x(2)))
scala> val badSingle = badSingleRecords.map(x => (x(0),x(1),x(2)))
Write file
scala> validSingle.repartition(1).saveAsTextFile("data/singValid")
scala> badSingle.repartition(1).saveAsTextFile("data/singBad")
Update 1 My regex above was wrong, i have updated it as. in scala backslash is a escape character, so need to duplicate
val singleReg = """\\w{3}\\s\\d+,\\s\\d{4}|\\d+\\s\\w{3},\\s\\d{4}|\\d+\/\\d+\/\\d{4}|\\d+-\\d+-\\d{4}""".r
Checked the regex on regex101 and the dates in the first 4 lines pass.
I have run the the test again and i am still getting the same result.
Upvotes: 3
Views: 11188
Reputation: 6095
There are 2 issues with the code:
data.txt
is wrong. It should be '|'
instead of "|"
.singleReg
is wrong.The correct code is as follows:
Load and RDD
scala> val file = sc.textFile("test/data.txt")
scala> val fileRDD = file.map(x => x.split('|'))
RegEx
scala> val singleReg = """\w{3}\s\d{2},\s\d{4}|\d{2}\s\w{3},\s\d{4}|\d{1}\/\d{2}\/\d{4}|\d{2}-\d{2}-\d{4}""".r
Filter
scala> val validSingleRecords = fileRDD.filter(x => (singleReg.pattern.matcher(x(1)).matches))
scala> val badSingleRecords = fileRDD.filter(x => !(singleReg.pattern.matcher(x(1)).matches))
Turn array into string
scala> val validSingle = validSingleRecords.map(x => (x(0),x(1),x(2)))
scala> val badSingle = badSingleRecords.map(x => (x(0),x(1),x(2)))
Write file
scala> validSingle.repartition(1).saveAsTextFile("data/singValid")
scala> badSingle.repartition(1).saveAsTextFile("data/singBad")
The above code will give you following output -
data/singValid
(Christopher,Jan 11, 2017,5 )
(Justin,11 Jan, 2017,5 )
(Thomas,6/17/2017,5 )
(John,11-08-2017,5 )
data/singBad
(Neli,2016,5 )
(Bilu,,5)
Upvotes: 5