Reputation: 51
I have a text file sherlock.txt containing multiple lines of text. I load it in spark-shell using:
val textFile = sc.textFile("sherlock.txt")
My purpose is to count the number of words in the file. I came across two alternative ways to do the job.
First using flatMap:
textFile.flatMap(line => line.split(" ")).count()
Second using map followed by reduce:
textFile.map(line => line.split(" ").size).reduce((a, b) => a + b)
Both yield the same result correctly. I want to know the differences in time and space complexity of the above two alternative implementations, if indeed there is any ?
Does the scala interpreter convert both into the most efficient form ?
Upvotes: 5
Views: 1631
Reputation: 330063
I will argue that the most idiomatic way to handle this would be to map
and sum
:
textFile.map(_.split(" ").size).sum
but the end of the day a total cost will be dominated by line.split(" ")
.
You could probably do a little bit better by iterating over the string manually and counting consecutive whitespaces instead of building new Array
but I doubt it is worth all the fuss in general.
If you prefer a little bit deeper insight count
is defined as:
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
where Utils.getIteratorSize
is pretty much a naive iteration over Iterator
with a sum of ones and sum
is equivalent to
_.fold(0.0)(_ + _)
Upvotes: 6