Reputation: 6766
I understand that partitionBy
function partitions my data. If I use rdd.partitionBy(100)
it will partition my data by key into 100 parts. i.e. data associated with similar keys will be grouped together
Upvotes: 11
Views: 23975
Reputation: 330093
There is no simple answer here. All depends on amount of data and available resources. Too large or too low number of partitions will degrade the performance.
Some resources claim the number of partitions should around twice as large as the number of available cores. From the other hand a single partition typically shouldn't contain more than 128MB and a single shuffle block cannot be larger than 2GB (See SPARK-6235).
Finally you have to correct for potential data skews. If some keys are overrepresented in your dataset it can result in suboptimal resource usage and potential failure.
No, or at least not directly. You can use keyBy
method to convert RDD to required format. Moreover any Python object can be treated as a key-value pair as long as it implements required methods which make it behave like an Iterable
of length equal two. See How to determine if object is a valid key-value pair in PySpark
tuple
of integers is.To quote Python glossary:
An object is hashable if it has a hash value which never changes during its lifetime (it needs a
__hash__()
method), and can be compared to other objects (it needs an__eq__()
method). Hashable objects which compare equal must have the same hash value.
Upvotes: 15
Reputation: 37
I recently used partitionby. What I did was restructure my data so that all those which I want to put in same partition have the same key, which in turn is a value from the data. my data was a list of dictionary, which I converted into tupples with key from dictionary.Initially the partitionby was not keeping same keys in same partition. But then I realized the keys were strings. I cast them to int. But the problem persisted. The numbers were very large. I then mapped these numbers to small numeric values and it worked. So my take away was that the keys need to be small integers.
Upvotes: -1