Reputation: 92
If I create two rdds like these:
a = sc.parallelize([[1 for j in range(3)] for i in xrange(10**9)])
b = sc.parallelize([[1 for j in xrange(10**9)] for i in range(3)])
When you think about it partitioning first one is intuitive, billion rows are partitioned around workers. But for the second one there are 3 rows and for each row there are billion item.
My question is: For the second line, if I have 2 workers does one row goes to one worker, and the other two rows goes to the other worker?
Upvotes: 2
Views: 502
Reputation: 330063
Data distribution in Spark is limited to the top level sequence you use to create a RDD.
Depending on a configuration in the second case you'll get at most three non-empty partitions, each assigned to a single worker so in the second scenario 1-2 split is a likely outcome.
Generally speaking small number of elements, especially very large, doesn't fit well into Spark processing model.
Upvotes: 2