leon
leon

Reputation: 10395

How to dynamic change existing files' block size in Hadoop?

I have a Hadoop cluster running. I use Hadoop API to create files in Hadoop. For example using: create(Path f, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress).

I am wondering how I can change the blocksize of the file after the file has been created? Using command or any API or any other methods? Because I can't find function for changing the blocksize in the API.

Thanks

Upvotes: 2

Views: 2685

Answers (3)

Sai
Sai

Reputation: 11

Try this:

hdfs dfs -D dfs.blocksize=[your block size] -put [your file/dir name] [dest file/dir]

Thank you, Sai

Upvotes: 1

Praveen Sripati
Praveen Sripati

Reputation: 33545

I am not sure if the block size can be changed dynamically once the file has been written to HDFS. One work around is to get the file out of HDFS and put it back again with the required block size. See email from Allen on how to do it.

Upvotes: 3

QuinnG
QuinnG

Reputation: 6424

I do not know of, and didn't find a way to dynamically change the block size of a single file using an API. There are multiple ways to change the block size of a file stored on the HDFS.

Aside from using the create function and specifying a different block size, they center around changing the default block size the HDFS stores at.

The most two most basic ways to use the changed default block size:

  • Copy file locally; Delete HDFS file; Upload file
  • Copy file to new location/name on HDFS; Delete initial file; Move/rename file to original location/name

The same idea could be done using the API. Copy the file to the local drive, delete the HDFS file, then use the API to create the file using the local copy with the desired block size.

I can surmise why this hasn't been implemented yet; While it would simplify this, it's probably not needed too often. To implement this, the file would need to be 're-assembled' then re-blocked according to the new size. On a very large file, this could saturate the network as all data could potentially travel the network multiple times.

I don't know hadoop's details enough to know exactly what shortfalls may exist trying to implement this functionality in the API, but I can see a few points of contention that may stall implementations while bigger needs are addressed.

hth

Upvotes: 3

Related Questions