What is the best way to store 1 Billion small text files?

Question

On a single filesystem, i need to store 1 Billion 1KB text files. Every file has an unique id string and it should be performance-optimized. What is the best?

EXT4: (example file structure for filename: kdWqpGQ1)

/kd/Wq/pG/Q1.file

or

/kdWqpGQ1.file

Or should i avoid this and use some kind of non-relational database?

Also, i can always share the 5TB volume i have into 5*1TB hard drives, having than 200M files each. I want to add that 1B files is a limit case, i will most probably reach only 500M.

Thank you!

yankee · Accepted Answer

Your first option is much faster.

Think of a directory in a file system like a a text file with an unsorted list of all files in this directory with an address where to find the file on the disk. To read a file you need to know the address of the file on the disk. If you have a path like '/myfilename', then you need to find the file / which is a directory and contains all files in this directory. Than you need to scan this file for the entry 'myfilename', which may in worst case require you to traverse the entire file. In average case that will take O(N/2) while N is apperently 1 billion (the number of total files in this directory).

If you have multiple directories... Say always 1000 files in a directory so that you have 3 levels of directorys and your filepath is now /A/B/myfilename, then you will need to first open the / directory, find A (which requires O(1000/2), open that file and find B (O(1000/2) again) and open that file again to find myfilename (yet again O(1000/2)). So adding those up will be 3*O(1000/2) = 1500, which is MUCH faster than the O(500.000.000) that we had previously.

This is a very important aspect of file systems to always keep in mind. If you have a directory that may run into danger to exceed having 10.000 files stored in it, I'd strongly recommend to think about a strategy to sort those files into subdirectories.

Whether you should better use a relational database depends on other questions: Do you need backups (to be created concurrently)? Do you need transactions beyond what simple journaling file systems offer? Do you need concurrency control? Do you need to search your through your files? How often do you need to access the files? How often do you change your files?

For further readings on file systems I recommand the book modern operating system by Tanenbaum (chapter 6 "File systems"), that is available online here: http://lovingod.host.sk/index.html?page=tanenbaum%2FOperating-Systems-Design.html

What is the best way to store 1 Billion small text files?

Answers (2)

Related Questions