Elektito
Elektito

Reputation: 4155

Using database instead of thousands of small files

At work, I have started working on a program that can potentially generate hundreds of thousands of mostly small files an hour. My predecessors have found out that working with many small files can become very slow, so they have resorted to some (in my opinion) crude methods to alleviate the problem.

So I asked my boss why won't we use a database instead and he gave me his oh-so-famous I-know-better-than-you look and told me obviously a database that big won't have a good performance.

My question is, is it really so? It seems to me that a database engine should be able to handle such data much better than the file system. Here are the conditions we have:

If you think we should opt with the database solution, what open source database system do you think will work best? (If I decide that a database will certainly work better, I'm going to push for a change whatever the boss says!)

Upvotes: 1

Views: 1428

Answers (2)

Paul Tomblin
Paul Tomblin

Reputation: 182762

As a minimal impact improvement, I'd split your millions of small files into a heirachy of directories. So say you were using uuids as your file names, I'd stip out the redundant urn:uuid: at the front, and then make 16 directories based on the first letter, and inside them make 16 subdirectories based on the second letter, and add even more levels if you need it. That alone will speed up the access quite a bit. Also, I would remove the directory whenever it became empty, to make sure the directory entry itself doesn't grow larger and larger.

Upvotes: 2

Jeff Foster
Jeff Foster

Reputation: 44696

This is another one of those "it depends" type questions.

If you are just writing data (write once, read hardly ever) then just use the file system. Maybe use a hash-directory approach to create lots of sub-directories (things tend to go slowly with many files in a single directory.

If you are writing hundreds of thousands of events for later querying (e.g. find everything with X > 10 and Y < 11) then a database sounds like a great idea.

If you are writing hundreds of thousands of bits of non-relational data (e.g. simple key-value pairs) then it might be worth investigating a NoSQL approach.

The best approach is probably to prototype all the ideas you can think of, measure and compare!

Upvotes: 6

Related Questions