hippietrail
hippietrail

Reputation: 16964

Writing a large number of files from a long running process?

I have a project which scans a large file (2.5GB) picking out strings which will then be written to some subset of several hundred files.

It would be fastest just to use normal buffered writes but

  1. I'm worried about running out of filehandles.
  2. I want to be able to watch the progress of the files while they're being written.
  3. I would prefer as little loss as possible if the process is interrupted. Incomplete files are still partially useful.

So instead I open in read/write mode, append the new line, and close again.

This was fast enough much of the time but I have found that on certain OSes this behaviour is a severe pessimization. Last time I ran it on my Windows 7 netbook I interrupted it after several days!

I can implement some kind of MRU filehandle manager which keeps so many files open and flushes after so many write operations each. But is this overkill?

This must be a common situation, is there a "best practice", a "pattern"?

Current implementation is in Perl and has run on Linux, Solaris, and Windows, netbooks to phat servers. But I'm interested in the general problem: language-independent and cross-platform. I've thought of writing the next version in C or node.js.

Upvotes: 3

Views: 185

Answers (1)

On Linux, you can open a lot of files (thousands). You can limit the number of opened handles in a single process with the setrlimit syscall and the ulimit shell builtin. You can query them with the getrlimit syscall and also using /proc/self/limits (or /proc/1234/limits for process of pid 1234). The maximum number of system-wide opened files is thru /proc/sys/fs/file-max (on my system, I have 1623114).

So on Linux you could not bother, and open many files at once.

And I would suggest to maintain a memoized cache of opened files, and use them if possible (in a MRU policy). Don't open and close each file too often, only when some limit has been reached... (e.g. when an open did fail).

In other words, you could have your own file abstraction (or just a struct) which knows the file name, may have an opened FILE* (or a null pointer) and keep the current offset, maybe also the last time of opening or writing, then manage a collection of such things in a FIFO discipline (for those having an opened FILE*). You certainly want to avoid close-ing (and later re-open-ing) a file descriptor too often.

You might occasionally (i.e. once a few minutes) call sync(2), but don't call it too often (certainly not more than once per 10 seconds). If using buffered FILE-s don't forget to sometimes fflush them. Again, don't do that very often.

Upvotes: 2

Related Questions