Reputation: 3008
I was trying to find line with maximum words, and i wrote the following lines, to run on spark-shell:
import java.lang.Math
val counts = textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
But since, map is one to one , and flatMap
is one to either zero or anything. So i tried replacing map
with flatMap
, in above code. But its giving error as:
<console>:24: error: type mismatch;
found : Int
required: TraversableOnce[?]
val counts = F1.flatMap(s => s.split(" ").size).reduce((a,b)=> Math.max(a,b))
If anybody could make me understand the reason, it will really be helpful.
Upvotes: 1
Views: 1680
Reputation: 1220
After reading Tired of Null Pointer Exceptions? Consider Using Java SE 8's Optional!'s part about why use flatMap() rather than Map(), I have realized the truly reason why flatMap() can not replace map() is that map() is not a special case of flatMap().
It's true that flatMap() means one-to-many, but that's not the only thing flatMap() does. It can also strip outer Stream() if put it simply.
See the definations of map and flatMap:
Stream<R> map(Function<? super T, ? extends R> mapper)
Stream<R> flatMap(Function<? super T, ? extends Stream<? extends R>> mapper)
the only difference is the type of returned value in inner function. What map() returned is "Stream<'what inner function returned'>", while what flatMap() returned is just "what inner function returned".
So you can say that flatMap() can kick outer Stream() away, but map() can't. This is the most difference in my opinion, and also why map() is not just a special case of flatMap().
ps:
If you really want to make one-to-one with flatMap
, then you should change it into one-to-List(one). That means you should add an outer Stream() manually which will be stripped by flatMap() later. After that you'll get the same effect as using map().(Certainly, it's clumsy. So don't do like that.)
Here are examples for Java8, but the same as Scala:
use map()
:
list.stream().map(line -> line.split(" ").length)
deprecated use flatMap()
:
list.stream().flatMap(line -> Arrays.asList(line.split(" ").length).stream())
Upvotes: 0
Reputation: 1175
flatMap
must return an Iterable
which is clearly not what you want. You do want a map
because you want to map a line to the number of words, so you want a one-to-one function that takes a line and maps it to the number of words (though you could create a collection with one element, being the size of course...).
FlatMap
is meant to associate a collection to an input, for instance if you wanted to map a line to all its words you would do:
val words = textFile.flatMap(x => x.split(" "))
and that would return an RDD[String]
containing all the words.
In the end, map
transforms an RDD
of size N into another RDD
of size N (e.g. your lines to their length) whereas flatMap
transforms an RDD
of size N into an RDD
of size P (actually an RDD
of size N into an RDD
of size N made of collections, all these collections are then flattened to produce the RDD
of size P).
P.S.: one last word that has nothing to do with your problem, it is more efficient to do (for a string s
)
val nbWords = s.split(" ").length
than call .size()
. Indeed, the split
method returns an array of String
and arrays do not have a size
method. So when you call .size()
you have an implicit conversion from Array[String]
to SeqLike[String]
which creates new objects. But Array[T]
do have a length
field so there's no conversion calling length
. (It's a detail but I think it's good habit though).
Upvotes: 1
Reputation: 170859
Any use of map
can be replaced by flatMap
, but the function argument has to be changed to return a single-element List
: textFile.flatMap(line => List(line.split(" ").size))
. This isn't a good idea: it just makes your code less understandable and less efficient.
Upvotes: 0