MPAK
MPAK

Reputation: 39

How to change the number of partitions of an RDD with a large local file (non-HDFS file)?

I have a 8.9GB text file and I create an RDD out of it and imported it in Spark.

textfile = sc.textFile("input.txt")

The number of partitions that Spark creates is 279, which is obtained by dividing the size of the input file by 32MB default HDFS block size. I can pass an argument to textfile and ask for more number of partitions, however, unfortunately I can not have fewer number of partitions than this default value (e.g., 4).

If I pass 4 as an argument Spark would ignore it and would proceed with 279 partitions.

Since my underlying filesystem is not HDFS, to me it seems very inefficient to split the input size in too many partitions. How can I force Spark to use fewer number of partitions? How can I change the default HDFS block size in Spark with a larger value?

Upvotes: 1

Views: 3049

Answers (3)

Sandeep Patil
Sandeep Patil

Reputation: 21

I've also tried most of the configurations , finally I worked with repartition()

textfile = sc.textFile("input.txt").repartition(2)
textfile.getNumPartitions
# result
2

Upvotes: 0

Prajwol Sangat
Prajwol Sangat

Reputation: 77

I have encountered the same error. I have tried changing the following setting:

conf.set("spark.hadoop.dfs.block.size", str(min_block_size))
conf.set("spark.hadoop.mapreduce.input.fileinputformat.split.minsize", str(min_block_size))
conf.set("spark.hadoop.mapreduce.input.fileinputformat.split.maxsize", str(max_block_size))

None of them are actually changing the input size and the size still remains 32 MB. Then, realised that I am using a local file system and not HDFS so probably that is why it is not working. I found another configuration that is supposed to work with local file (I think) as below.

# The maximum number of bytes to pack into a single partition when reading files.
conf.set("spark.files.maxPartitionBytes", str(min_block_size)) 

However, there is no effect at all. I tried one more configuration change by adding the following:

conf.set("spark.sql.files.maxPartitionBytes", str(sql_block_size))

It changed the input size for the data frames but not RDD :(.

If anyone has found any configuration that has actually changed the input size for RDD, I would appreciate the answer.

Upvotes: 0

Metadata
Metadata

Reputation: 2083

In your case since the block size is 32mb, you are getting 279 partitions. You can increase the block size in your HDFS to any other suitable value so that it fits your requirement. You can find the block size parameter in hdfs-site.xml

Upvotes: 0

Related Questions