Sean Nguyen
Sean Nguyen

Reputation: 13128

How to change hdfs block size in pyspark?

I use pySpark to write parquet file. I would like to change the hdfs block size of that file. I set the block size like this and it doesn't work:

sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")

Does this have to be set before starting the pySpark job? If so, how to do it.

Upvotes: 4

Views: 7411

Answers (3)

Thomas Decaux
Thomas Decaux

Reputation: 22671

You can set blockSize of files that spark write:

myDataFrame.write.option("parquet.block.size", 256 * 1024 * 1024).parquet(destinationPath)

Upvotes: 0

genomics-geek
genomics-geek

Reputation: 195

I had a similiar issue, but I figured out the issue. It needs a number not "128m". Therefore this should work (worked for me at least!):

block_size = str(1024 * 1024 * 128)
sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)

Upvotes: 0

mrsrinivas
mrsrinivas

Reputation: 35404

Try setting it through sc._jsc.hadoopConfiguration() with SparkContext

from pyspark import SparkConf, SparkContext 
conf = (SparkConf().setMaster("yarn")) 
sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set("dfs.block.size", "128m")
txt = sc.parallelize(("Hello", "world", "!"))
txt.saveAsTextFile("hdfs/output/path") #saving output with 128MB block size

in Scala:

sc.hadoopConfiguration.set("dfs.block.size", "128m")

Upvotes: 2

Related Questions