Reputation: 21635
I notice that if I apply a mapPartitions
on an RDD, the partitions get an iterable object. Within the mapPartitions
function, I then call the toArray
member function of the iterable to convert that iterable object to an Array object. Does calling toArray
involves copying, or does it just start referencing the same part of the memory as an Array? If it does involve copying, what are ways to prevent copying?
Upvotes: 4
Views: 956
Reputation: 3725
One important correction to your question -- the partition data structure exposed during mapPartitions
is an Iterator, not Iterable. Here's the interface difference:
Iterator
has the next()
and hasNext()
methods, which allow you to visit each element in the collection once. Once the next()
method of an iterator is called, the last element is gone (unless you've stored it in a variable).Iterable
has the ability to produce an Iterator
whenever you want. This lets you visit each element as many times as you want.In terms of implementation, an Iterator
can stream through data. You really only need to have one element in memory at a time, which is loaded when next()
is called. If you're reading from a text file with Spark (sc.textFile
) it does exactly this, and uses almost no memory to do simple iteration through partitions.
You're absolutely allowed to call iterator.toArray
, but you probably don't want to. You end up shoving all of the data into memory (Spark can't load just one element at a time, because you've asked for all of it at once), and either copy each piece of the data (for primitives, like Int
) or allocate a new reference for each piece of data (for AnyRef
, like Array[_]
). There is no way to prevent this copying.
There are times when converting a partition iterator to an array is what you want to do, but these use cases are rare. You risk running out of memory and slowing down your application a huge amount due to unnecessary allocation and GC, so think hard about whether it's really needed!
Upvotes: 3