Kobe-Wan Kenobi
Kobe-Wan Kenobi

Reputation: 3884

Store records on HDFS or in HBase

I have a following scenario:

My question is related to second point - Those files are later copied to HDFS - I'm worried that it could be a problem that there is a large number of small files (e.g. 1MB).

My idea is to store that files in a database, so I would avoid the problem with small files and also be able to query data (select data for user for period). Is that a better approach?

If the answer is positive, which databases can I use? So I need the database to be:

Upvotes: 2

Views: 622

Answers (1)

Simone Pessotto
Simone Pessotto

Reputation: 1581

I think that HBase is perfect for you necessity.

I had also the "small file problem" and I solved it using HBase.

Storing small file in HDFS directly it's a bad practice and could be a problem.

From the HBase project site:

Apache HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.

  • HBase is made for Hadoop
  • Rows can stores different columns in a column family and updated values have timestamp, so you can go back in the history of the cell
  • HBase and Hadoop are made for MaReduce jobs ( Rows can be input/output for a job)

In my case I had a lot of small file (200 Kb / 1 Mb) and now I store these files in a table with some column as Header/Information and a column for the binary content of the file and the file name as key (the file name is a UUID)

Upvotes: 2

Related Questions