Reputation: 799
I am a python programmer and as the Python API is too slow for my Spark application and decided to port my code to Spark Scala API, to compare the computation time.
I am trying to filter out the lines that start with numeric characters from a huge file using Scala API in Spark. In my file, some lines have numbers and some have words and I want the lines that only have numbers.
So, in my Python application, I have these lines.
l = sc.textFile("my_file_path")
l_filtered = l.filter(lambda s: s[0].isdigit())
which works exactly as I want.
This is what I have tried so far.
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.forall(_.isDigit))
This throws out an error saying that char does not have forall() function.
I also tried taking the first character of the lines using s.take(1) and apply isDigit() function on that in the following way.
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).isDigit)
and this too...
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).Character.isDigit)
This also throws an error.
This is basically a small error and as I am not accustomed to Scala syntax, I am having hard time figuring it out. Any help would be appreciated.
Edit: As answered for this question, I tried writing the function, but I am unable to use that in filter() function in my application. To apply the function for all the lines in the file.
Upvotes: 1
Views: 3407
Reputation: 6260
In Scala indexing syntax uses parens ()
instead of brackets []
. The exact translation of your Python code would be this:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_(0).isDigit)
A more idiomatic extraction of the first symbol would be using head
method:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.head.isDigit)
Both of these methods would fail if your file contains empty lines.
If that's the case, then you probably want this:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.map(_.isDigit).getOrElse(false))
UPD.
As curious noted map(predicate).getOrElse(false)
on Option
could be shortened to exists(predicate)
:
val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.exists(_.isDigit))
Upvotes: 6
Reputation: 2928
You can use regular expressions:
scala> List("1hello","2world","good").filter(_.matches("^[0-9].*$"))
res0: List[String] = List(1hello, 2world)
or you can do like this with lesser no. of operations as this file might contain a huge number of lines to filter.
scala> List("1hello","world").filter(_.headOption.exists(_.isDigit))
res1: List[String] = List(1hello)
replace List[String]
with your lines l
in your case to work.
Upvotes: 2