Reputation: 7198
Consider an application that is CPU bound, but also has high-performance I/O requirements.
I'm comparing Linux file I/O to Windows, and I can't see how epoll will help a Linux program at all. The kernel will tell me that the file descriptor is "ready for reading," but I still have to call blocking read() to get my data, and if I want to read megabytes, it's pretty clear that that will block.
On Windows, I can create a file handle with OVERLAPPED set, and then use non-blocking I/O, and get notified when the I/O completes, and use the data from that completion function. I need to spend no application-level wall-clock time waiting for data, which means I can precisely tune my number of threads to my number of cores, and get 100% efficient CPU utilization.
If I have to emulate asynchronous I/O on Linux, then I have to allocate some number of threads to do this, and those threads will spend a little bit of time doing CPU things, and a lot of time blocking for I/O, plus there will be overhead in the messaging to/from those threads. Thus, I will either over-subscribe or under-utilize my CPU cores.
I looked at mmap() + madvise() (WILLNEED) as a "poor man's async I/O" but it still doesn't get all the way there, because I can't get a notification when it's done -- I have to "guess" and if I guess "wrong" I will end up blocking on memory access, waiting for data to come from disk.
Linux seems to have the starts of async I/O in io_submit, and it seems to also have a user-space POSIX aio implementation, but it's been that way for a while, and I know of nobody who would vouch for these systems for critical, high-performance applications.
The Windows model works roughly like this:
Steps 1/2 are typically done as a single thing. Steps 3/4 are typically done with a pool of worker threads, not (necessarily) the same thread as issues the I/O. This model is somewhat similar to the model provided by boost::asio, except boost::asio doesn't actually give you asynchronous block-based (disk) I/O.
The difference to epoll in Linux is that in step 4, no I/O has yet happened -- it hoists step 1 to come after step 4, which is "backwards" if you know exactly what you need already.
Having programmed a large number of embedded, desktop, and server operating systems, I can say that this model of asynchronous I/O is very natural for certain kinds of programs. It is also very high-throughput and low-overhead. I think this is one of the remaining real shortcomings of the Linux I/O model, at the API level.
Upvotes: 62
Views: 29679
Reputation: 7144
(2020) If you're using a 5.1 or above Linux kernel you can use the io_uring
interface for file-like I/O and obtain excellent asynchronous operation.
Compared to the existing libaio
/KAIO interface, io_uring
has the following advantages:
liburing
helper library)recvmsg()
/sendmsg()
are supported from >=5.3, see messages mentioning the word support in io_uring.c's git history)read
/write
(e.g. fsync
(>=5.1), fallocate
(>=5.6), splice
(>=5.7) and more)Compared to glibc's POSIX AIO, io_uring
has the following advantages:
io_uring
most certainly can!The Efficient IO with io_uring document goes into far more detail as to io_uring
's benefits and usage. The What's new with io_uring document describes new features added to io_uring
between the 5.2 - 5.5 kernels, while The rapid growth of io_uring
LWN article describes which features were available in each of the 5.1 - 5.5 kernels with a forward glance to what was going to be in 5.6 (also see LWN's list of io_uring articles). There's also a Faster IO through io_uring Kernel Recipes videoed presentation (slides) from late 2019 and What’s new with io_uring Kernel Recipes videoed presentation (slides) from mid 2022 by io_uring
author Jens Axboe. Finally, the Lord of the io_uring tutorial gives an introduction to io_uring
usage.
The io_uring
community can be reached via the io_uring mailing list and the io_uring mailing list archives show daily traffic at the start of 2021.
Re "support partial I/O in the sense of recv()
vs read()
": a patch went into the 5.3 kernel that will automatically retry io_uring
short reads and a further commit went into the 5.4 kernel that tweaks the behaviour to only automatically take care of short reads when working with "regular" files on requests that haven't set the REQ_F_NOWAIT
flag (it looks like you can request REQ_F_NOWAIT
via IOCB_NOWAIT
or by opening the file with O_NONBLOCK
). Thus you can get recv()
style- "short" I/O behaviour from io_uring
too.
io_uring
Though the interface is young (its first incarnation arrived in May 2019), some open-source software is using io_uring
"in the wild":
io_uring
ioengine to the libaio
ioengine on an Optane device.io_uring
backend for MultiRead in Dec 2019 and was part of its 6.7.3 release. Jens states io_uring
helped to dramatically cut latency.io_uring
backend in Dec 2019. While some of the author's original points were addressed in newer kernels, at the time of writing (mid 2021) libev's author has some choice words about io_uring
's maturity and is taking a wait-and-see approach before implementing further improvements.io_uring
backend outperforming the threads
and aio
backends on one workload of random 16K blocks.io_uring
VFS backend in Feb 2020 and was part of the Samba 4.12 release. In the "Linux io_uring VFS backend." Samba mailing list thread, Stefan Metzmacher (the commit author) says the io_uring
module was able to push roughly 19% more throughput (compared to some unspecified backend) in a synthetic test. You can also read the "Async VFS Future" PDF presentation by Stefan for some of the motivation behind the changes.io_uring
more accessible to pure rust. rio is one library talked about a bit and the author says they achieved higher throughput compared to using sync calls wrapped in threads. The author gave a presentation about his database and library at FOSDEM 2020 which included a section extolling the virtues of io_uring
.io_uring
. The author (Glauber Costa) published a document called Modern storage is plenty fast. It is the APIs that are bad showing that with careful tuning Glommio could get over 2.5 times the performance over regular (non-io_uring
) syscalls when performing sequential I/O on an Optane device.io_uring
io_uring
improvements (e.g. the workaround to reduce for filesystem inode contention). There is a presentation "Asynchronous IO for PostgreSQL" (be aware the video is broken until the 5 minute mark) (PDF) motivating the need for PostgreSQL changes and demonstrating some experimental results. He has expressed hope of getting his optional io_uring
support into PostgreSQL 14 and seems acutely aware of what does and doesn't work even down to the kernel level. In December 2020, Andres further discusses his PostgreSQL io_uring
work in the "Blocking I/O, async I/O and io_uring" pgsql-hackers mailing list thread and mentions the work in progress can be seen over in https://github.com/anarazel/postgres/tree/aio .io_uring
support which needs a 5.9 kernelio_uring
support but its progress into the project has been slowio_uring
support for eventing (but not syscalls) in April 2020 and the Linux: full io_uring I/O issue outlines plans to integrate it furtherio_uring
io_uring
syscalls can be used. This distro doesn't pre-package the liburing
helper library but you can build it for yourself.io_uring
syscalls can be used. As above, the distro doesn't pre-package liburing
.liburing
so io_uring
is usable.io_uring
syscalls can be used. This distro doesn't pre-package the liburing
helper library but you can build it for yourself.io_uring
(a previous version of this answer mistakenly said it did). There is an Add io_uring support Red Hat knowledge base article (contents is behind a subscriber paywall) that is "in progress".io_uring
. The kernel is new enough (5.14) but support for io_uring
is explicitly disabled.Hopefully io_uring
will usher in a better asynchronous file-like I/O story for Linux.
(To add a thin veneer of credibility to this answer, at some point in the past Jens Axboe (Linux kernel block layer maintainer and inventor of io_uring
) thought this answer might be worth upvoting :-)
Upvotes: 88
Reputation: 6713
As explained in:
http://code.google.com/p/kernel/wiki/AIOUserGuide
and here:
http://www.ibm.com/developerworks/library/l-async/
Linux does provide async block I/O at the kernel level, APIs as follows:
aio_read Request an asynchronous read operation
aio_error Check the status of an asynchronous request
aio_return Get the return status of a completed asynchronous request
aio_write Request an asynchronous operation
aio_suspend Suspend the calling process until one or more asynchronous requests have completed (or failed)
aio_cancel Cancel an asynchronous I/O request
lio_listio Initiate a list of I/O operations
And if you asked who are the users of these API, it is the kernel itself - just a small subset is shown here:
./drivers/net/tun.c (for network tunnelling):
static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
./drivers/usb/gadget/inode.c:
ep_aio_read(struct kiocb *iocb, const struct iovec *iov,
./net/socket.c (general socket programming):
static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov,
./mm/filemap.c (mmap of files):
generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
./mm/shmem.c:
static ssize_t shmem_file_aio_read(struct kiocb *iocb,
etc.
At the userspace level, there is also the io_submit() etc API (from glibc), but the following article offer an alternative to using glibc:
http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt
It directly implement the API for functions like io_setup() as direct syscall (bypassing glibc dependencies), a kernel mapping via the same "__NR_io_setup" signature should exist. Upon searching the kernel source at:
http://lxr.free-electrons.com/source/include/linux/syscalls.h#L474 (URL is applicable for the latest version 3.13) you are greeted with the direct implementation of these io_*() API in the kernel:
474 asmlinkage long sys_io_setup(unsigned nr_reqs, aio_context_t __user *ctx);
475 asmlinkage long sys_io_destroy(aio_context_t ctx);
476 asmlinkage long sys_io_getevents(aio_context_t ctx_id,
481 asmlinkage long sys_io_submit(aio_context_t, long,
483 asmlinkage long sys_io_cancel(aio_context_t ctx_id, struct iocb __user *iocb,
The later version of glibc should make these usage of "syscall()" to call sys_io_setup() unnecessary, but without the latest version of glibc, you can always make these call yourself if you are using the later kernel with these capabilities of "sys_io_setup()".
Of course, there are other userspace option for asynchronous I/O (eg, using signals?):
http://personal.denison.edu/~bressoud/cs375-s13/supplements/linux_altIO.pdf
or perhap:
What is the status of POSIX asynchronous I/O (AIO)?
"io_submit" and friends are still not available in glibc (see io_submit manpages), which I have verified in my Ubuntu 14.04, but this API is linux-specific.
Others like libuv, libev, and libevent are also asynchronous API:
http://nikhilm.github.io/uvbook/filesystem.html#reading-writing-files
http://software.schmorp.de/pkg/libev.html
All these API aimed to be portable across BSD, Linux, MacOSX, and even Windows.
In terms of performance I have not seen any numbers, but suspect libuv may be the fastest, due to its lightweightedness?
https://ghc.haskell.org/trac/ghc/ticket/8400
Upvotes: 3
Reputation: 7198
The real answer, which was indirectly pointed to by Peter Teoh, is based on io_setup() and io_submit(). Specifically, the "aio_" functions indicated by Peter are part of the glibc user-level emulation based on threads, which is not an efficient implementation. The real answer is in:
io_submit(2)
io_setup(2)
io_cancel(2)
io_destroy(2)
io_getevents(2)
Note that the man page, dated 2012-08, says that this implementation has not yet matured to the point where it can replace the glibc user-space emulation:
http://man7.org/linux/man-pages/man7/aio.7.html
this implementation hasn't yet matured to the point where the POSIX AIO implementation can be completely reimplemented using the kernel system calls.
So, according to the latest kernel documentation I can find, Linux does not yet have a mature, kernel-based asynchronous I/O model. And, if I assume that the documented model is actually mature, it still doesn't support partial I/O in the sense of recv() vs read().
Upvotes: 20