Shiva
Shiva

Reputation: 799

Finding lines that start with a digit in Scala using filter() method

I am a python programmer and as the Python API is too slow for my Spark application and decided to port my code to Spark Scala API, to compare the computation time.

I am trying to filter out the lines that start with numeric characters from a huge file using Scala API in Spark. In my file, some lines have numbers and some have words and I want the lines that only have numbers.

So, in my Python application, I have these lines.

l = sc.textFile("my_file_path")
l_filtered = l.filter(lambda s: s[0].isdigit())

which works exactly as I want.

This is what I have tried so far.

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.forall(_.isDigit))

This throws out an error saying that char does not have forall() function.

I also tried taking the first character of the lines using s.take(1) and apply isDigit() function on that in the following way.

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).isDigit)

and this too...

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(x => x.take(1).Character.isDigit)

This also throws an error.

This is basically a small error and as I am not accustomed to Scala syntax, I am having hard time figuring it out. Any help would be appreciated.

Edit: As answered for this question, I tried writing the function, but I am unable to use that in filter() function in my application. To apply the function for all the lines in the file.

Upvotes: 1

Views: 3407

Answers (2)

Ihor Kaharlichenko
Ihor Kaharlichenko

Reputation: 6260

In Scala indexing syntax uses parens () instead of brackets []. The exact translation of your Python code would be this:

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_(0).isDigit)

A more idiomatic extraction of the first symbol would be using head method:

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.head.isDigit)

Both of these methods would fail if your file contains empty lines.

If that's the case, then you probably want this:

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.map(_.isDigit).getOrElse(false))

UPD.

As curious noted map(predicate).getOrElse(false) on Option could be shortened to exists(predicate):

val l = sc.textFile("my_file_path")
val l_filtered = l.filter(_.headOption.exists(_.isDigit))

Upvotes: 6

curious
curious

Reputation: 2928

You can use regular expressions:

scala> List("1hello","2world","good").filter(_.matches("^[0-9].*$"))
res0: List[String] = List(1hello, 2world)

or you can do like this with lesser no. of operations as this file might contain a huge number of lines to filter.

scala> List("1hello","world").filter(_.headOption.exists(_.isDigit))
res1: List[String] = List(1hello)

replace List[String] with your lines l in your case to work.

Upvotes: 2

Related Questions