Spark Multidimensional RDD partitioning

Question

If I create two rdds like these:

a = sc.parallelize([[1 for j in range(3)] for i in xrange(10**9)])

b = sc.parallelize([[1 for j in xrange(10**9)] for i in range(3)])

When you think about it partitioning first one is intuitive, billion rows are partitioned around workers. But for the second one there are 3 rows and for each row there are billion item.

My question is: For the second line, if I have 2 workers does one row goes to one worker, and the other two rows goes to the other worker?

Spark Multidimensional RDD partitioning

Answers (1)

Related Questions