Muthukumar
Muthukumar

Reputation: 11

Hadoop as Data Archive System

I am analyzing on the possibilities to use hadoop (HDFS) as data archival solution which is giving linear scalability and lower cost maintenance per tera byte.

Please let me know the your recommendations and set of the parameters like I/O, Memory, Disk which has to be analyzed to viz hadoop as data archival system.

On the related query, While trying to upload a 500MB sized file using hadoop shell as,

$ #We've 500MB file created using dd

$ dd if=/dev/zero of=500MBFile.txt bs=524288000 count=1

$ hadoop fs -Ddfs.block.size=67108864 -copyFromLocal 500MBFile.txt /user/cloudera/

Please let me know why the input file is not getting splitted based on the block size (64MB). This will be good to understand since as part of data archival if we're getting 1TB file, how this will be splitted and distributed across the cluster.

I've tried the exercise using single node cloudera hadoop setup and replication factor is 1.

Thanks again for your great response.

Upvotes: 1

Views: 2182

Answers (3)

BlueFish
BlueFish

Reputation: 11

You can load the file in .har format.

You get more details here : Hadoop Archives

Upvotes: 1

Syntharz Tech Team
Syntharz Tech Team

Reputation: 103

Few inputs

  1. Consider compression in your solution. Looks like you will be using Text files. You can achieve around 80% compression.
  2. Make sure you select Hadoop friendly (i.e.splittable) compression

Upvotes: 0

David Gruzman
David Gruzman

Reputation: 8088

You can use HDFS as archiving/storage solution, while I am doubt it is optimal. Specifically it is not as high-available as let say OpenStack Swift and not suited for store small files
In the same time if HDFS is Your choice I would suggest to build the cluster with storage oriented nodes. I would describe them as:
a) Put large and slow SATA disks. Since data is not going to be read / written constantly - desktop grade disks might do - it will be a major saving.
b) Put minimal memory - I would suggest 4 GB. It will not add much costs, but still enable ocaassional MR processing.
c) Sinlge CPU will do.

Regarding copyFromLocal. Yep, file is getting split according to the defined block size.

Distribution on cluster will be even across the cluster, taking to the account replication factor. HDFS will also try to put each block on more then one rack

Upvotes: 2

Related Questions