TheChymera
TheChymera

Reputation: 17944

Manage many small (~5KB) files with git-annex

I have been using git-annex for a while now to manage my data, and I have found it quite satisfactory.

However, the performance of git annex is quite lacking when dealing with my neuroimaging data. This sort of data often comes as many image files (5KB), e.g around 36.000 per participant per experiment. You can see how even for a few experiments my data repository accumulates well over a milion files.

Is there any way to mitigate the enormous lag when running git annex sync or git annex get? If not, is there any (roughly) similar software that might allow me to manage multiple repositories of neuroimaging data?

Upvotes: 2

Views: 379

Answers (2)

Bort
Bort

Reputation: 2491

I agree with db48x. If changing the neuroimaging software is not an option, you can use one container per experiment (~180mb is a reasonable size) and store this with git-annex. For data access you mount this file as an extra file-system with loopback in memory. This should significantly reduce the access time and burden on git-annex.

Upvotes: 1

db48x
db48x

Reputation: 3176

Large numbers of files are inefficient on multiple levels; perhaps you could improve the neuroimaging software?

If that's not an option, you can do several things. The first is to store the data on an SSD. These operations are slow because they must query the status of each of the files in your repository, and putting them on an SSD makes every disk read much, much faster.

Another is to limit the number of files in any given directory. You may not be able to split the files from a single experiment up, but make sure you're not putting files from multiple experiments in the same directory. This is because the access time on a directory is often proportional to the number of files in that directory.

Another would be to investigate different filesystems, or different filesystem configurations; Not all filesystems are good with large directories. For example, on ext3/4 you can set the filesystem option dir_index so that it uses b-tree indexes to speed up access times on large directories. Use the tune2fs program to set it.

A last desperate option might be to combine all of these tiny files into archives, such as tarballs or zip files. This might complicate working with them, but would greatly reduce the number of files you have to deal with. You may also be able to script away some of the complexity this causes; for example, when you need to view one of these images your script could extract the tarball into a temporary directory, launch the viewer, and then delete the extracted files when it exits.

Upvotes: 3

Related Questions