Reputation: 5673
The background is developing DBMS kernel, specifically database checkpoint processing. Rules of the game are such that we need to wait for outstanding asynchronous IOs on the file to finish, before issuing fsync().
Current solution we deploy, is to count asynchronous IOs in-flight, manually, wait for this count to go own to 0, before fsyncing or FlushFileBuffer-ing. The question is whether we really have to do that, perhaps kernels/filesystems do it by themselves?
The OS in questions are Windows and Linux, mainly, although I'm also curious how BSD based OS handle that, too.
On Linux, we'e using libaio, for asynchronous IO.
Upvotes: 2
Views: 987
Reputation: 9782
On Windows: Yes, for a given HANDLE
instance, the current asynchronous i/o queue is drained before FlushFileBuffers()
is executed. If you are writing a database, you really ought to use NtFlushBuffersFileEx()
instead, it offers far finer granularity of synchronisation, makes a huge difference.
On FreeBSD: Certainly with ZFS, yes. I can't say I've tested UFS, but I'd be surprised if it were not the same. FreeBSD implements cached async i/o as a kernel thread pool in any case, only uncached async i/o is truly async.
On Mac OS: No idea, and worse, disk i/o semantics have been all over the place last few releases. It was once very good, like BSD, but recently it's been downhill. Async file i/o was always nearly unusable on Mac OS in any case, the maximum 16 depth queue limit plus the requirement to use signals for async i/o completion is very hard to mix well with threaded code.
On Linux: For synchronous i/o, yes fsync()
enforces a total ordering, per inode, if your filesystem guarantees that (all the popular ones do). For libaio, which only really works right for O_DIRECT
i/o in any case, I believe that the block storage layer does flush all enqueued i/o before telling the device to barrier, unless you have disabled barriers. For io_uring (which you ought to be using instead of libaio), for non-O_DIRECT
i/o, the ordering is whatever the filesystem enforces for per-inode i/o once io_uring has processed the submission. For io_uring with O_DIRECT
i/o, the block storage layer is a singleton, and should enforce ordering across the whole system, once io_uring has processed the submission.
I keep mentioning "once io_uring has processed the submission" because io_uring works with ring buffered queues. If you add an entry to the submission queue, it will get processed in order of submission by io_uring (i.e. the queue gets drained). From the moment of submission to the moment of io_uring consuming the submission, there is no ordering. But once io_uring has consumed the submission, the destination filesystem has been told of the i/o, and whatever ordering guarantees it implements it will apply to the ordering of completions it emits back to io_uring. So, when using io_uring, do not proceed after i/o submission until io_uring has drained your i/o submission request from the submission queue. This happens naturally using the syscall to tell io_uring to drain the queue, or for polling drains, you can watch the "last drained item" offset the kernel atomically updates as it consumes submission items.
Source: I am the author of the reference library for the WG21 C++ standardisation of low level i/o. Caveat: all of the above is purely from my memory and experience, and may be bitrotted or wrong.
Upvotes: 4