jsharpe
jsharpe

Reputation: 2671

nomad fails release CSI volume during "restart -reschedule" which would move allocations to new host

Context:

Problem:

From what I can tell, nomad doesn't even try to release the volume. There's no "failed to release" message in any log file (nomad server, nomad client, ebs controller, ebs node).

The first error I see anywhere is this:

[ERROR] nomad.fsm: CSIVolumeClaim failed: error="volume max claims reached"

which occurs on the new node as it attempts to mount the volume.

At this point the previous allocation is dead/stopped, but the volume still mounted on the previous host. And the volume is marked as unavailable.

Upvotes: 2

Views: 211

Answers (1)

aZaD
aZaD

Reputation: 29

  • First, upgrade both Nomad and the AWS EBS CSI plugin to their latest versions. Newer versions often address compatibility issues and bug fixes like this one.
  • Then, Identify the stuck volume: Find the allocation ID and volume ID of the problematic volume. Detach the volume manually: Use the AWS CLI or EBS API to detach the volume from the old host. Ensure the previous allocation is indeed stopped before proceeding. Clear "max claims reached" error: Delete the volume claim associated with the allocation on the new host. This will remove the claim and allow Nomad to mount the volume again.
  • Lastly, Consider using Nomad's -force flag with restart -reschedule to try forcing the release of volumes. However, use this cautiously as it might lead to data loss if not handled carefully.

Upvotes: 0

Related Questions