Yassine Fadhlaoui
Yassine Fadhlaoui

Reputation: 344

How to make ZFS high available

I am working on a project where we use ZFS as a storage volume manager. On top of ZFS, an ISCSI tgt daemon is running and exposing the ZFS devices as SCSI disks. The problem now is ZFS high availability. In fact, ZFS cannot be clustered. The solutions below have some issues that's why I avoided them.

Is there any way to make these SCSI disks high available by making ZFS pool high available? could add a clustered filesystem on top of ZFS make any sense?

Upvotes: 3

Views: 3926

Answers (2)

Dan
Dan

Reputation: 7737

Andrew Henle’s comment is the most obvious way to do this: force-import the pool with zpool import -f on the secondary server and prevent the primary from re-importing the storage. The second part is the hard part though!

If you can physically detach the storage immediately after the server dies, perfect. If not, which will be the case for most systems, you will need some way to manage this transfer of pool ownership between servers, probably with some kind of keepalive / ownership leases protocol. You can either do this in the storage itself or at some higher level.

  • Doing it in storage means you can prevent the primary from reattaching the pool (or from continuing to write to the pool if it never really died! eek!) by first checking that you have ownership before doing a write. Leases make sense for this because they give you ensured ownership for some fixed amount of time before you have to renew the lease, let’s say N seconds, so you don’t have to check ownership before every IO. When the secondary wants to take over, you write a new lease on disk to take ownership of it at some future time T (through T+N seconds), then wait N seconds for any previous lease written to the disk to expire (which ensure the old system will see your new lease and stop issuing writes), and finally import the filesystem fully. In ZFS it might make sense to create leases for a given txg instead of using timestamp-based leases, since timestamps mean your servers need very similar times on them or your mutual exclusion may not work (although the ZIL creates issues for this because it can be updated outside of a txg IIRC). Ideally this would be a feature of ZFS itself but I don’t think anyone has implemented it yet (although I know it’s been discussed).
  • Doing it at a higher layer is advantageous too though, because you can use the highest-layer symptoms possible to trigger the failover. For example, maybe your primary is able to talk to storage but not the network, or maybe it became unresponsive because of some performance issue / some background task that started but is still making progress slowly. To cover these cases you want to do keepalives reported by the clients that are trying to reach the storage over the network, rather than by the storage servers themselves.

Ultimately the best solution is to use high-level symptoms to decide whether to failover, but low-level mutual exclusion enforcement. However without support inside of ZFS for mutual exclusion, you may need to do both above the ZFS layer, for example by making a shim layer that checks for ownership before issuing a write to ZFS.

If you think network partitions and performance problems are not really going to be an issue compared to machine crashes / reboots (probably a reasonable assumption in small-ish datacenters since these are lower-probability events), then you probably don’t need the storage-level mutual exclusion at all, and the higher-layer solution would work fine.

Upvotes: 2

Mikhail Zakharov
Mikhail Zakharov

Reputation: 1089

See https://mezzantrop.wordpress.com/portfolio/the-beast/ if it's applicable for you.

Upvotes: 0

Related Questions