Reputation: 3884
I have a following scenario:
My question is related to second point - Those files are later copied to HDFS - I'm worried that it could be a problem that there is a large number of small files (e.g. 1MB).
My idea is to store that files in a database, so I would avoid the problem with small files and also be able to query data (select data for user for period). Is that a better approach?
If the answer is positive, which databases can I use? So I need the database to be:
Upvotes: 2
Views: 622
Reputation: 1581
I think that HBase is perfect for you necessity.
I had also the "small file problem" and I solved it using HBase.
Storing small file in HDFS directly it's a bad practice and could be a problem.
From the HBase project site:
Apache HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware.
In my case I had a lot of small file (200 Kb / 1 Mb) and now I store these files in a table with some column as Header/Information and a column for the binary content of the file and the file name as key (the file name is a UUID)
Upvotes: 2