Reputation: 2947
I have a deduplicated storage of some million files in a two-level hashed directory structure. The filesystem is an ext4 partition on a magnetic disk. The path of a file is computed by its MD5 hash like this:
e93ac67def11bbef905a7519efbe3aa7 -> e9/3a/e93ac67def11bbef905a7519efbe3aa7
When processing* a list of files sequentially (selected by metadata stored in a separate database), I can literally hear the noise produced by the seeks ("randomized" by the hashed directory layout as I assume).
My actual question is: Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner, given they are stored on an ext4 partition on a magnetic disk (implying the use of linux)?
Such optimization is of course only useful if there is a sufficient share of small files. So please don't care too much about the size distribution of files. Without loss of generality, you may actually assume that there are only small files in each list.
As a potential solution, I was thinking of sorting the files by their physical disk locations or by other (heuristic) criteria that can be related to the total amount and length of the seek operations needed to process the entire list.
A note on file types and use cases for illustration (if need be)
The files are a deduplicated backup of several desktop machines. So any file you would typically find on a personal computer will be included on the partition. The processing however will affect only a subset of interest that is selected via the database.
Here are some use cases for illustration (list is not exhaustive):
While researching for this question, I learned of the FIBMAP
ioctl
command (e.g. mentioned here) which may be worth a shot, because the files will not be moved around and the results may be stored along the metadata. But I suppose that will only work as sort criterion if the location of a file's inode correlates somewhat with the location of the contents. Is that true for ext4?
*) i.e. opening each file and reading the head of the file (arbitrary number of bytes) or the entire file into memory.
Upvotes: 0
Views: 171
Reputation: 1
A file (especially when it is large enough) is scattered on several blocks on the disk (look e.g. in the figure of ext2 wikipage, it still is somehow relevant for ext4, even if details are different). More importantly, it could be in the page cache (so won't require any disk access). So "sorting the file list by disk location" usually does not make any sense.
I recommend instead improving the code accessing these files. Look into system calls like posix_fadvise(2) and readahead(2).
If the files are really small (hundreds of bytes each only), it is probable that using something else (e.g. sqlite or some real RDBMS like PostGreSQL, or gdbm ...) could be faster.
BTW, adding more RAM could enlarge the page cache size, so the overall experience. And replacing your HDD by some SSD would also help.
(see also linuxatemyram)
Is it possible to sort a list of files to optimize read speed / minimize seek times?
That is not really possible. File system fragmentation is not (in practice) important with ext4. Of course, backing up all your file system (e.g. in some tar or cpio archive) and restoring it sequentially (after making a fresh file system with mkfs
) might slightly lower fragmentation, but not that much.
You might optimize your file system settings (block size, cluster size, etc... e.g. various arguments to mke2fs(8)). See also ext4(5).
Is there a (generic) way to process a potentially long list of potentially small files in a seek-optimized manner.
If the list is not too long (otherwise, split it in chunks of several hundred files each), you might open(2) each file there and use readahead(2) on each such file descriptor (and then close(2) it). This would somehow prefill your page cache (and the kernel could reorder the required IO operations).
(I don't know how effective is that in your case; you need to benchmark)
I am not sure there is a software solution to your issue. Your problem is likely IO-bound, so the bottleneck is probably the hardware.
Notice that on most current hard disks, the CHS addressing (used by the kernel) is some "logical" addressing handled by the disk controller and is not much related to physical geometry any more. Read about LBA, TCQ, NCQ (so today, the kernel has no direct influence on the actual mechanical movements of a hard disk head). I/O scheduling mostly happens in the hard disk itself (not much more in the kernel).
Upvotes: 1