TheWalkingData
TheWalkingData

Reputation: 1067

How to filter out alphanumeric strings in Scala using regular expression

I want to filter out alphanumeric and numeric words from my file. I'm working on Spark-Shell. These are the contents of my file sparktest.txt:

This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?

Defining the file for collection:

scala> val myLines = sc.textFile("sparktest.txt")

Saving the line into an Array with words of length greater than 2:

scala> val myWords = myLines.flatMap(x => x.split("\\W+")).filter(x => x.length >2)

Defining a regular expression to use. I only want string that match "[A-Za-z]+":

scala> val regexpr = "[A-Za-z]+".r

Attempting to filter out the alphanumeric and numeric strings:

scala> val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)
<console>:27: error: scala.util.matching.Regex does not take parameters
       val myOnlyWords = myWords.map(x => x).filter(x => regexpr(x).matches)

This is where I'm stuck. I want the result to look like this:

Array[String] = Array(This, file, not, Would, you, this, file, HDFS)

Upvotes: 2

Views: 12638

Answers (2)

Rohan Aletty
Rohan Aletty

Reputation: 2442

You can actually do this in one transformation and filter the split arrays within your flatMap:

val myWords = myLines.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))

When I run this in spark-shell, I see:

scala> val rdd1 = sc.parallelize(Array("This is 1 file not 54783. Would you l1ke this file to be Writt3n to HDFS?"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[11] at parallelize at <console>:21

scala> val myWords = rdd1.flatMap(x => x.split("\\W+").filter(x => x.matches("[A-Za-z]+") && x.length > 2))
myWords: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:23

scala> myWords.collect
...
res0: Array[String] = Array(This, file, not, Would, you, this, file, HDFS)

Upvotes: 4

Alexandr Dorokhin
Alexandr Dorokhin

Reputation: 850

You can use filter(x => regexpr.pattern.matcher(x).matches) or filter(_.matches("[A-Za-z]+"))

Upvotes: 1

Related Questions