Michael Davis
Michael Davis

Reputation: 138

Google Cloud Platform zonal SSDs higher performance persistence than local SSDs?

I've been running a database application that writes data synchronously to disk, and so, looking for the best disk throughput. I've found that GCP's local SSDs are supposed to provide the best performance (both for IOPS and MB/s). However, I've tried using them and found that when performing a benchmark of synchronous database writes, the throughput achieved by a persistent zonal SSD is significantly better than that of the local SSD. Strangely the use of a single local SSD results in better performance than a RAID configuration with 4 partitions.

To test the performance I ran a benchmark consisting of a single thread creating transactions in a loop and performing a random 4KB write.

The persistent zonal SSD was 128GB, while the local SSD consists of 4 SSDs in RAID 0. An N2D machine with 32 vCPUs was used to eliminate CPU bottleneck. To ensure it wasn't a problem the with OS or filesystem, I've tried various different versions, including the ones recommended by Google. However, the result is always the same regardless.

The results for my experiments on average are:

SSD Latency Throughput
Zonal P SSD (128 GB) ~1.5ms ~700 writes/second
Local SSD (4 SSDs NVME RAID 0) ~14ms ~71 writes/second
Local SSD (1 SSD) ~13ms ~75 writes/second

I'm at a bit of a loss on how to proceed, as I'm not sure if this result should be expected. If so, it seems like my best option is to use zonal persistent disks. Do you think that these results seem correct, or might there be some problem with my setup?

Suggestions of turning of write-caching etc. will improve performance, however, the goal here is to obtain fast performance for synchronous disk writes. Otherwise, my best option would be zonal persistent SSDs (they offer replicated storage) or just using RAM which will always be faster than any SSD.

As AdolfoOG suggested, there might be an issue with my RAID configuration so to shed some light on this, I use the following commands to create my RAID 0 setup with four devices. Note, /dev/nvme0nX refers to each NVMe device I'm using.

sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 
sudo mkfs.ext4 -F /dev/md0 
sudo mkdir /mnt/disks/ 
sudo mkdir /mnt/disks/stable-store 
sudo mount /dev/md0 /mnt/disks/stable-store 
sudo chmod a+w /mnt/disks/stable-store 

This should be the same process as what Google advises unless I messed something up of course!

Upvotes: 0

Views: 2097

Answers (2)

Chris Madden
Chris Madden

Reputation: 2660

Local SSD are optimized for temporary storage and writes are ack'd once they hit the SSD write cache. Then, per the documentation, within 2 seconds those writes will be committed to stable media. Given the reliability guarantees of Local SSD (very low...it can fail and data is lost) this seems like a reasonable tradeoff.

With Local SSD if your application does a write, then a fsync, then a write, then a fsync, the high latency of those fsync calls is going to add up and likely explains the high latency you observed. Solution would be to skip those fsyncs either in in your DB or when you mount the filesystem; see the documentation link mentioned for more. Frankly, whenever you use Local SSD you should be prepared to lose that data, either by being able to recreate it (job processing use case) or because you have redundancy at a higher layer (app/db use case).

With a Zonal PD writes are stable when ack'd, and any fsync is basically a no-op returning quite quickly. The write latency from your tests seems on the high side. If you are reaching IO or throughput limits you will see those plateauing and latency increasing. For the lowest latency, I'd try creating the disk + VM in one request (this increases the likelihood that the resources will be nearby within the zone) and see if that helps. If latency is stable, and you need more IOPs and aren't at disk IOP limits, then most likely you need to increase concurrency to get work "in flight" resulting in higher IOPs.

Upvotes: 1

AdolfoOG
AdolfoOG

Reputation: 186

Answer completely edited after original question edited:

I tried to replicate your situation, I used a more "stock" approach, I didn't code anything to test the MB/s, instead I just used "dd" and "hdparm", I also used a N2-standard-32 instance type with a 100 GB Persistent SSD as boot disk and a RAID 0 of 4 NVME Local SSDs. below my results:

Write tests:

root@instance-1:~# dd if=/dev/zero of=./test oflag=direct bs=1M count=16k 16384+0 records in 16384+0 records out 17179869184 bytes (17 GB, 16 GiB) copied, 18.2175 s, 943 MB/s

root@instance-1:~# dd if=/dev/zero of=./test oflag=direct bs=1M count=32k 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB, 32 GiB) copied, 42.1738 s, 815 MB/s

root@instance-1:~# dd if=/dev/zero of=./test oflag=direct bs=1M count=64k 65536+0 records in 65536+0 records out 68719476736 bytes (69 GB, 64 GiB) copied, 83.6243 s, 822 MB/s

Local SSD:

root@instance-1:~# dd if=/dev/zero of=/mnt/disks/raid/test oflag=direct bs=1M count=16k 16384+0 records in 16384+0 records out 17179869184 bytes (17 GB, 16 GiB) copied, 10.6567 s, 1.6 GB/s

root@instance-1:~# dd if=/dev/zero of=/mnt/disks/raid/test oflag=direct bs=1M count=32k 32768+0 records in 32768+0 records out 34359738368 bytes (34 GB, 32 GiB) copied, 21.26 s, 1.6 GB/s

root@instance-1:~# dd if=/dev/zero of=/mnt/disks/raid/test oflag=direct bs=1M count=64k 65536+0 records in 65536+0 records out 68719476736 bytes (69 GB, 64 GiB) copied, 42.4611 s, 1.6 GB/s

Read tests:

Persisten SSD:

root@instance-1:~# hdparm -tv /dev/sda

/dev/sda: multcount = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 13054/255/63, sectors = 209715200, start = 0 Timing buffered disk reads: 740 MB in 3.00 seconds = 246.60 MB/sec root@instance-1:~# hdparm -tv /dev/md0

Local SSD

/dev/md0: readonly = 0 (off) readahead = 8192 (on) geometry = 393083904/2/4, sectors = 3144671232, start = unknown Timing buffered disk reads: 6888 MB in 3.00 seconds = 2761.63 MB/sec

So, I'm actually seeing better performance in the local SSD raid and, according to the table of performance, I got the expected result for reads, and writes according to this table:

Throughput (MB/s):  Read: 2,650;  Write: 1,400

So, maybe there is something odd with the way you tested the performance as you mentioned that you write a little script to do it, maybe if you try with a more "stock" approach you'll get the same results as I got.

Upvotes: 2

Related Questions