pythonic
pythonic

Reputation: 21635

Relationship between iterable and Array in Spark

I notice that if I apply a mapPartitions on an RDD, the partitions get an iterable object. Within the mapPartitions function, I then call the toArray member function of the iterable to convert that iterable object to an Array object. Does calling toArray involves copying, or does it just start referencing the same part of the memory as an Array? If it does involve copying, what are ways to prevent copying?

Upvotes: 4

Views: 956

Answers (1)

Tim
Tim

Reputation: 3725

One important correction to your question -- the partition data structure exposed during mapPartitions is an Iterator, not Iterable. Here's the interface difference:

  • An Iterator has the next() and hasNext() methods, which allow you to visit each element in the collection once. Once the next() method of an iterator is called, the last element is gone (unless you've stored it in a variable).
  • An Iterable has the ability to produce an Iterator whenever you want. This lets you visit each element as many times as you want.

In terms of implementation, an Iterator can stream through data. You really only need to have one element in memory at a time, which is loaded when next() is called. If you're reading from a text file with Spark (sc.textFile) it does exactly this, and uses almost no memory to do simple iteration through partitions.

You're absolutely allowed to call iterator.toArray, but you probably don't want to. You end up shoving all of the data into memory (Spark can't load just one element at a time, because you've asked for all of it at once), and either copy each piece of the data (for primitives, like Int) or allocate a new reference for each piece of data (for AnyRef, like Array[_]). There is no way to prevent this copying.

There are times when converting a partition iterator to an array is what you want to do, but these use cases are rare. You risk running out of memory and slowing down your application a huge amount due to unnecessary allocation and GC, so think hard about whether it's really needed!

Upvotes: 3

Related Questions