cedivad
cedivad

Reputation: 2654

What is the best way to store 1 Billion small text files?

On a single filesystem, i need to store 1 Billion 1KB text files. Every file has an unique id string and it should be performance-optimized. What is the best?

EXT4: (example file structure for filename: kdWqpGQ1)

/kd/Wq/pG/Q1.file

or

/kdWqpGQ1.file

Or should i avoid this and use some kind of non-relational database?

Also, i can always share the 5TB volume i have into 5*1TB hard drives, having than 200M files each. I want to add that 1B files is a limit case, i will most probably reach only 500M.

Thank you!

Upvotes: 5

Views: 4304

Answers (2)

yankee
yankee

Reputation: 40870

Your first option is much faster.

Think of a directory in a file system like a a text file with an unsorted list of all files in this directory with an address where to find the file on the disk. To read a file you need to know the address of the file on the disk. If you have a path like '/myfilename', then you need to find the file / which is a directory and contains all files in this directory. Than you need to scan this file for the entry 'myfilename', which may in worst case require you to traverse the entire file. In average case that will take O(N/2) while N is apperently 1 billion (the number of total files in this directory).

If you have multiple directories... Say always 1000 files in a directory so that you have 3 levels of directorys and your filepath is now /A/B/myfilename, then you will need to first open the / directory, find A (which requires O(1000/2), open that file and find B (O(1000/2) again) and open that file again to find myfilename (yet again O(1000/2)). So adding those up will be 3*O(1000/2) = 1500, which is MUCH faster than the O(500.000.000) that we had previously.

This is a very important aspect of file systems to always keep in mind. If you have a directory that may run into danger to exceed having 10.000 files stored in it, I'd strongly recommend to think about a strategy to sort those files into subdirectories.

Whether you should better use a relational database depends on other questions: Do you need backups (to be created concurrently)? Do you need transactions beyond what simple journaling file systems offer? Do you need concurrency control? Do you need to search your through your files? How often do you need to access the files? How often do you change your files?

For further readings on file systems I recommand the book modern operating system by Tanenbaum (chapter 6 "File systems"), that is available online here: http://lovingod.host.sk/index.html?page=tanenbaum%2FOperating-Systems-Design.html

Upvotes: 3

jforberg
jforberg

Reputation: 6772

"Or should I avoid this and use some kind of non-relational database?"

Yes, certainly. Due to the way file-systems work, it's a very bad idea to put your data into a billion different files. Think of it as storing a fortune of 1 billion dollars in the form of quarters, in a big container. There's no way to make that storage scheme "performance-optimised".

The NTFS file-system, common on Windows, has a theoretical limit of about 4 billion files. By default, the minimum file size on NTFS is 4 kB, meaning that your 1 TB database would instantly grow to 4 TB for this reason only.

You should probably be looking at a database system like sql or sqlite. These has the advantage that you don't have to think about naming schemes and other practical details. You could also devise a custom format that stores all data in just a few files. If you give details about the kind of data you're handling, maybe someone has more specific advice for you!

Upvotes: 5

Related Questions