user2511788
user2511788

Reputation: 159

aio_write takes more time than plain write on ext4?

I have a C program that writes 32768 blocks, each block is 16K size (total size of 512MB), to an ext4 filesystem on a system running 3.18.1 kernel. The regular write system call version of this program takes 5.35 seconds to finish the writes (as measured by gettimeofday before and after the for loop). The async io version of this program however takes the following times:

  1. to queue all the aio_writes (32768 aio_writes): 7.43 seconds
  2. poll to finish each IO request: additional 4.93 seconds

The output files are opened with these flags:O_WRONLY, O_CREAT, O_NONBLOCK

Why does async io take more than double the write() time? Even the Time-to-queue-async-io-request/time-to-write-sync-io is 1.4.

Since some people marked it off-topic, I looked at the definition and decided to paste the code - that seems to be the only reason why it should be marked off-topic. I am not asking why the code is not working, only why aio is much slower than regular writes, especially since all parallel writes are to different blocks. Here's the aio code, followed by the non-aio code:

AIO program

#define MAX_AIO        (16384*2)
#define BUFSIZE        16384

struct mys {
    int status;
    struct aiocb aio;
};

void set_aiocb(struct mys *aio, int num, int fd)
{
    int i;

    for (i = 0; i < num; i++) {
        aio[i].aio.aio_fildes = fd;
        aio[i].aio.aio_offset = BUFSIZE * i;
        aio[i].aio.aio_buf = malloc(BUFSIZE);
        set_buf(aio[i].aio.aio_buf, BUFSIZE, i);
        aio[i].aio.aio_nbytes = BUFSIZE;
        aio[i].aio.aio_reqprio = fd;
        aio[i].aio.aio_sigevent.sigev_notify = SIGEV_NONE;
        aio[i].aio.aio_sigevent.sigev_signo = SIGUSR1;
        aio[i].aio.aio_sigevent.sigev_value.sival_ptr = &aio[i];
        aio[i].aio.aio_lio_opcode = 0;
        aio[i].status = EINPROGRESS;
    }
}

void main(void)
{
    int fd = open("/tmp/AIO", O_WRONLY | O_CREAT, 0666);
    int i, open_reqs = MAX_AIO;
    struct mys aio[MAX_AIO];
    struct timeval start, end, diff;

    set_aiocb(aio, MAX_AIO, fd);

    gettimeofday(&start, NULL);
    for (i = 0; i < MAX_AIO; i++)
        aio_write(&aio[i].aio);

    while (open_reqs > 0) {
        for (i = 0; i < MAX_AIO; i++) {
            if (aio[i].status == EINPROGRESS) {
                aio[i].status = aio_error(&(aio[i].aio));
                if (aio[i].status != EINPROGRESS)
                    open_reqs--;
            }
        }
    }
    gettimeofday(&end, NULL);
    timersub(&end, &start, &diff);
    printf("%d.%d\n", (int)diff.tv_sec, (int)diff.tv_usec);
}

Regular IO program

#define MAX_AIO        (16384*2)
#define BUFSIZE        16384

char buf[MAX_AIO][BUFSIZE];
void main(void)
{
    int i, fd = open("/tmp/NON_AIO", O_WRONLY | O_CREAT, 0666);
    struct timeval start, end, diff;

    gettimeofday(&start, NULL);
    for (i = 0; i < MAX_AIO; i++)
        write(fd, buf[i], BUFSIZE);
    gettimeofday(&end, NULL);
    timersub(&end, &start, &diff);
    printf("%d.%d\n", (int)diff.tv_sec, (int)diff.tv_usec);
}

Upvotes: 3

Views: 344

Answers (1)

Jonathan Leffler
Jonathan Leffler

Reputation: 754550

You aren't really comparing apples with apples.

In the AIO code, you have a separately allocated buffer for each of the write operations, so the program has 512 MiB (16 * 16 * 2 KiB) of memory allocated, plus the 32 K copies of the AIO control structure. That memory has to be allocated, initialized (each buffer gets a different value if the set_buf() function — which is not shown — sets each byte of the buffer to the value of the third parameter), then copied by the kernel to the driver, possibly via the kernel buffer pool.

In the regular IO code, you have one big, contiguous buffer which is initialized to all zeroes which you write to the disk.

To make the comparison equitable, you should use the same infrastructure in both programs, creating the AIO structures, but the regular IO code will then simply step through the structures, writing the data portion of each (while the AIO code will behave more or less as shown). And I expect you will find that the performance is a lot more nearly similar when you do that.

Upvotes: 1

Related Questions