Reputation: 651
I was trying to use dd to test the performance of my ceph filesystem. During testing, I found something confusing, that is, dd with oflag=dsync or conv=fdatasync/fsync is around 10 times faster than dd with oflag=direct.
My network is 2*10Gb
/mnt/testceph# dd if=/dev/zero of=/mnt/testceph/test1 bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 23.1742 s, 46.3 MB/s
/mnt/testceph# dd if=/dev/zero of=/mnt/testceph/test1 bs=1G count=1 conv=fdatasync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.22468 s, 483 MB/s
Upvotes: 3
Views: 10979
Reputation: 7124
dd with oflag=dsync or conv=fdatasync/fsync is around 10 times faster than dd with oflag=direct
conv=fdatasync
/ conv=fsync
still mean I/O is initially queued to the kernel cache and destaged to disk as the kernel sees fit. This gives the kernel a big opportunity to merge I/Os, create parallel submission out of I/O that has yet to be destaged and generally decouples I/O submission to the kernel from I/O acceptance by the disk (to the extent that buffering will allow). Only when dd
has finished sending ALL the data will it have to wait for anything still only in cache to be flushed to disk (and with fsync
that includes any metadata).
oflag=dsync
is still allowed to make use of kernel buffering - it just causes a flush+wait for completion after each submission. Since you are sending only one giant write this will put you into near enough the same scenario as doing conv=fdatasync
above.
When you specify oflag=odirect
you are saying "trust that all my parameters are sensible and turn off as much kernel buffering as you can". In your case a bs
that huge is nonsensical with odirect
as your "disk"'s maximum transfer block size (let alone the optimal size) is almost certainly smaller. You'll likely trigger splitting but due to memory requirements on O_DIRECT
the splitting points may lead to smaller I/Os than the above cases.
It's hard to tell for sure what's going on though. Really we would need to see how the I/O was leaving the bottom of the kernel (e.g. by comparing iostat
output during runs) to get a better idea of what's going on.
TLDR; maybe using odirect
is leading to smaller sized I/Os leaving the kernel and thus causing worse performance in your scenario?
Upvotes: 11