Reputation: 1067
I want to filter out alphanumeric and numeric words from my file. I'm working on Spark-Shell. These are the contents of my file sparktest.txt:
This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?
Defining the file for collection:
scala> val myLines = sc.textFile("sparktest.txt")
Saving the line into an Array with words of length greater than 2:
scala> val myWords = myLines.flatMap(x => x.split("\\W+")).filter(x => x.length >2)
Defining a regular expression to use. I only want string that match "[A-Za-z]+":
scala> val regexpr = "[A-Za-z]+".r
Attempting to filter out the alphanumeric and numeric strings:
scala> val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
<console>:27: error: scala.util.matching.Regex does not take parameters
val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
This is where I'm stuck. I want the result to look like this:
Array[String] = Array(This, file, not, Would, you, this, file, HDFS)
Upvotes: 2
Views: 12638
Reputation: 2442
You can actually do this in one transformation and filter the split arrays within your flatMap
:
val myWords = myLines.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
When I run this in spark-shell, I see:
scala> val rdd1 = sc.parallelize(Array("This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:21
scala> val myWords = rdd1.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
myWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:23
scala> myWords.collect
...
res0: Array[String] = Array(This, file, not, Would, you, this, file, HDFS)
Upvotes: 4
Reputation: 850
You can use filter(x => regexpr.pattern.matcher(x).matches)
or filter(_.matches("[A-Za-z]+"))
Upvotes: 1