Files larger than block size in HDFS

Question

IT is common knowledge that writing a single file which is larger than HDFS block size is not optimal, same goes for many very small files.

However, when performing a repartition('myColumn) operation in spark it will create a single partition per item (let's assume day) which contains all the records (as a single file) which might be several GB in size (assume 20GB) whereas HDFS block size is configured to be 256 MB.

Is it actually bad that the file is too large? When reading the file back in (assuming it is a splittable file like parquet or orc with gzip or zlib compression) spark is creating >> 1 task per file i.e. does this mean I do not need to worry about specifying maxRecordsPerFile / file size larger than HDFS block size ?

OneCricketeer · Accepted Answer

Having a singular large file in a splittable format is a good thing in HDFS. The namenode has to maintain less file references and there are more blocks to parallize processing.

In fact, 20 GB still isn't large in Hadoop terms considering it'll fit on a cheap flash drive

Files larger than block size in HDFS

Answers (1)

Related Questions