Reputation: 87
When I was coding scala program in local-mode spark, the code is similar to RDD.map(x => function()).collect
, there is no log output in the console for a long time and I guess it stucks. So I change the action collect
into count
, the whole execution completed soon. Plus, there is little records produced by the map
phase to be collected by collect
, so problem can't be caused by the network trans when sending back the result to the driver.
Can anyone know the reason or has run into the similar problem?
Upvotes: 2
Views: 6183
Reputation: 18042
The method count sums the number of entries of the RDD
for every partition, and it returns an integer consisting in that number, hence the data transfer is minimal; in the other hand, the method collect as its name says brings all the entries to the driver as a List
therefore if there isn't enough memory you may get several exceptions (this is why it's not recommended to do a collect
if you aren't sure that it will fit in your driver, there are normally faster alternatives like take, which also triggers lazy transformations), besides it requires to transfer much more data.
Upvotes: 8