Ged
Ged

Reputation: 18098

SPARK RDD Partition on a single HDFS Split

If we have a file of 128MB with an HDFS split of 128MB and we issue sc.textFile(xxx,4), what actually happens? What does the RDD in fact mean in this case in terms of partitioning? 4 processing partitions still or just 1?

Upvotes: 0

Views: 117

Answers (1)

Simon Schiff
Simon Schiff

Reputation: 711

When you use a code like this:

JavaRDD<String> in = sc.textFile(xxx,4);
in.persist();

Then your RDD has 4 Partitions. They should have a size of 32 MB each. Then you can do something likes this:

rdd.count()

When you run then your code locally with local[4], then the count will be executed with 4 processes (tasks) in parallel.

Upvotes: 1

Related Questions