Store records on HDFS or in HBase

Question

I have a following scenario:

Measurements are uploaded through a web service in form of files
Those files are later copied to HDFS
Each measurement contains a number of features (values), for one or more parameters
Measurements might have different number of values
Measurements are processed using machine learning algorithms on Hadoop
Not all measurements are taken, but for a certain user, for certain time period (e.g. perform processing on files from user X uploaded during period Y-Z)
Intermediate results are stored on HDFS, as well as the final result

My question is related to second point - Those files are later copied to HDFS - I'm worried that it could be a problem that there is a large number of small files (e.g. 1MB).

My idea is to store that files in a database, so I would avoid the problem with small files and also be able to query data (select data for user for period). Is that a better approach?

If the answer is positive, which databases can I use? So I need the database to be:

Compatible with Hadoop (Big data)
Rows may contain different number of values (like in case of time series)
Retrieve measurements for certain user for certain period
Records are input to MapReduce job

Simone Pessotto · Accepted Answer

I think that HBase is perfect for you necessity.

I had also the "small file problem" and I solved it using HBase.

Storing small file in HDFS directly it's a bad practice and could be a problem.

From the HBase project site:

Apache HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.

HBase is made for Hadoop
Rows can stores different columns in a column family and updated values have timestamp, so you can go back in the history of the cell
HBase and Hadoop are made for MaReduce jobs ( Rows can be input/output for a job)

In my case I had a lot of small file (200 Kb / 1 Mb) and now I store these files in a table with some column as Header/Information and a column for the binary content of the file and the file name as key (the file name is a UUID)

Store records on HDFS or in HBase

Answers (1)

Related Questions