greatji
greatji

Reputation: 87

What's the difference between collect and count actions?

When I was coding scala program in local-mode spark, the code is similar to RDD.map(x => function()).collect, there is no log output in the console for a long time and I guess it stucks. So I change the action collect into count, the whole execution completed soon. Plus, there is little records produced by the map phase to be collected by collect, so problem can't be caused by the network trans when sending back the result to the driver.

Can anyone know the reason or has run into the similar problem?

Upvotes: 2

Views: 6183

Answers (1)

Alberto Bonsanto
Alberto Bonsanto

Reputation: 18042

The method count sums the number of entries of the RDD for every partition, and it returns an integer consisting in that number, hence the data transfer is minimal; in the other hand, the method collect as its name says brings all the entries to the driver as a List therefore if there isn't enough memory you may get several exceptions (this is why it's not recommended to do a collect if you aren't sure that it will fit in your driver, there are normally faster alternatives like take, which also triggers lazy transformations), besides it requires to transfer much more data.

Upvotes: 8

Related Questions