Josh
Josh

Reputation: 69

Longhorn RWX PVC Stuck Terminating

My team is experiencing an issue with longhorn where sometimes our RWX PVCs are indefinitely terminating after running kubectl delete. A symptom of this is that the finalizers never get removed.

It was explained to me that the longhorn-csi-plugin containers should execute ControllerUnpublishVolume when no workload is using the volume and then execute DeleteVolume to remove the finalizer. Upon inspection of the logs when this issue occurs, the ControllerUnpublishVolume event looks unsuccessful and DeleteVolume is never called. It looks like the response to ControllerUnpublishVolume is {} which does not seem right to me. The following logs are abridged and only include lines relevant to the volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1:

2023-04-04T19:28:52.993226550Z time="2023-04-04T19:28:52Z" level=info msg="CreateVolume: creating a volume by API client, name: pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1, size: 21474836480 accessMode: rwx"
...
2023-04-04T19:29:01.119651932Z time="2023-04-04T19:29:01Z" level=debug msg="Polling volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 state for volume created at 2023-04-04 19:29:01.119514295 +0000 UTC m=+2789775.717296902"
2023-04-04T19:29:01.123721718Z time="2023-04-04T19:29:01Z" level=info msg="CreateVolume: rsp: {\"volume\":{\"capacity_bytes\":21474836480,\"volume_context\":{\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"recurringJobSelector\":\"[{\\\"name\\\":\\\"backup-1-c9964a87-77074ba4\\\",\\\"isGroup\\\":false}]\",\"share\":\"true\",\"staleReplicaTimeout\":\"30\"},\"volume_id\":\"pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1\"}}"
...
2023-04-04T19:29:01.355417228Z time="2023-04-04T19:29:01Z" level=info msg="ControllerPublishVolume: req: {\"node_id\":\"node1.example.com\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":5}},\"volume_context\":{\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"recurringJobSelector\":\"[{\\\"name\\\":\\\"backup-1-c9964a87-77074ba4\\\",\\\"isGroup\\\":false}]\",\"share\":\"true\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1677846786942-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1\"}"
...
2023-04-04T19:29:01.362958346Z time="2023-04-04T19:29:01Z" level=debug msg="ControllerPublishVolume: volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 is ready to be attached, and the requested node is node1.example.com"
2023-04-04T19:29:01.363013363Z time="2023-04-04T19:29:01Z" level=info msg="ControllerPublishVolume: volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 with accessMode rwx requesting publishing to node1.example.com"
...
2023-04-04T19:29:13.477036437Z time="2023-04-04T19:29:13Z" level=debug msg="Polling volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 state for volume published at 2023-04-04 19:29:13.476922567 +0000 UTC m=+2789788.074705223"
2023-04-04T19:29:13.479320941Z time="2023-04-04T19:29:13Z" level=info msg="ControllerPublishVolume: volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 with accessMode rwx published to node1.example.com"
...
2023-04-04T19:31:59.230234638Z time="2023-04-04T19:31:59Z" level=info msg="ControllerUnpublishVolume: req: {\"node_id\":\"node1.example.com\",\"volume_id\":\"pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1\"}"
2023-04-04T19:31:59.233597451Z time="2023-04-04T19:31:59Z" level=debug msg="requesting Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 detachment for node1.example.com"
...
2023-04-04T19:32:01.242531135Z time="2023-04-04T19:32:01Z" level=debug msg="Polling volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 state for volume unpublished at 2023-04-04 19:32:01.242373423 +0000 UTC m=+2789955.840156051"
2023-04-04T19:32:01.245744768Z time="2023-04-04T19:32:01Z" level=debug msg="Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 unpublished from node1.example.com"
...
2023-04-04T19:32:01.268399507Z time="2023-04-04T19:32:01Z" level=info msg="ControllerUnpublishVolume: req: {\"node_id\":\"node1.example.com\",\"volume_id\":\"pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1\"}"
2023-04-04T19:32:01.270584270Z time="2023-04-04T19:32:01Z" level=debug msg="requesting Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 detachment for node1.example.com"
...
2023-04-04T19:32:02.512117513Z time="2023-04-04T19:32:02Z" level=info msg="ControllerPublishVolume: req: {\"node_id\":\"node2.example.com\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":5}},\"volume_context\":{\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"recurringJobSelector\":\"[{\\\"name\\\":\\\"backup-1-c9964a87-77074ba4\\\",\\\"isGroup\\\":false}]\",\"share\":\"true\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1677846786942-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1\"}"
...
2023-04-04T19:32:02.528810094Z time="2023-04-04T19:32:02Z" level=debug msg="ControllerPublishVolume: volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 is ready to be attached, and the requested node is node2.example.com"
2023-04-04T19:32:02.528829340Z time="2023-04-04T19:32:02Z" level=info msg="ControllerPublishVolume: volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 with accessMode rwx requesting publishing to node2.example.com"
...
2023-04-04T19:32:03.273890290Z time="2023-04-04T19:32:03Z" level=debug msg="Polling volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 state for volume unpublished at 2023-04-04 19:32:03.272811565 +0000 UTC m=+2789957.870594214"
2023-04-04T19:32:03.289152604Z time="2023-04-04T19:32:03Z" level=debug msg="Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 unpublished from node1.example.com"
...
2023-04-04T19:32:03.760644399Z time="2023-04-04T19:32:03Z" level=info msg="ControllerPublishVolume: req: {\"node_id\":\"node1.example.com\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":5}},\"volume_context\":{\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"recurringJobSelector\":\"[{\\\"name\\\":\\\"backup-1-c9964a87-77074ba4\\\",\\\"isGroup\\\":false}]\",\"share\":\"true\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1677846786942-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1\"}"
2023-04-04T19:32:03.770050254Z time="2023-04-04T19:32:03Z" level=debug msg="ControllerPublishVolume: volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 is ready to be attached, and the requested node is node1.example.com"
2023-04-04T19:32:03.770093689Z time="2023-04-04T19:32:03Z" level=info msg="ControllerPublishVolume: volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 with accessMode rwx requesting publishing to node1.example.com"
...
2023-04-04T19:32:04.654700819Z time="2023-04-04T19:32:04Z" level=debug msg="Polling volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 state for volume published at 2023-04-04 19:32:04.654500435 +0000 UTC m=+2789959.252283106"
2023-04-04T19:32:04.657991819Z time="2023-04-04T19:32:04Z" level=info msg="ControllerPublishVolume: volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 with accessMode rwx published to node2.example.com"
2023-04-04T19:32:04.658583043Z time="2023-04-04T19:32:04Z" level=info msg="ControllerPublishVolume: rsp: {}"
...
2023-04-04T19:32:05.822264526Z time="2023-04-04T19:32:05Z" level=debug msg="Polling volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 state for volume published at 2023-04-04 19:32:05.82208573 +0000 UTC m=+2789960.419868382"
2023-04-04T19:32:05.826506892Z time="2023-04-04T19:32:05Z" level=info msg="ControllerPublishVolume: volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 with accessMode rwx published to node1.example.com"
2023-04-04T19:32:05.827051042Z time="2023-04-04T19:32:05Z" level=info msg="ControllerPublishVolume: rsp: {}"
...
2023-04-04T20:07:03.798730851Z time="2023-04-04T20:07:03Z" level=info msg="ControllerUnpublishVolume: req: {\"node_id\":\"node1.example.com\",\"volume_id\":\"pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1\"}"
2023-04-04T20:07:03.802360032Z time="2023-04-04T20:07:03Z" level=debug msg="requesting Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 detachment for node1.example.com"
2023-04-04T20:07:05.808796454Z time="2023-04-04T20:07:05Z" level=debug msg="Polling volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 state for volume unpublished at 2023-04-04 20:07:05.808607472 +0000 UTC m=+2792060.406390073"
2023-04-04T20:07:05.811653301Z time="2023-04-04T20:07:05Z" level=debug msg="Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 unpublished from node1.example.com"
...
2023-04-04T20:07:11.017524059Z time="2023-04-04T20:07:11Z" level=info msg="ControllerUnpublishVolume: req: {\"node_id\":\"node2.example.com\",\"volume_id\":\"pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1\"}"
2023-04-04T20:07:11.024127188Z time="2023-04-04T20:07:11Z" level=debug msg="requesting Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 detachment for node2.example.com"
...
2023-04-04T20:07:13.047834933Z time="2023-04-04T20:07:13Z" level=debug msg="Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 unpublished from node2.example.com"
2023-04-04T20:07:13.047839690Z time="2023-04-04T20:07:13Z" level=info msg="ControllerUnpublishVolume: rsp: {}"
2023-04-04T20:07:13.378731066Z time="2023-04-04T20:07:13Z" level=info msg="ControllerUnpublishVolume: req: {\"node_id\":\"node2.example.com\",\"volume_id\":\"pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1\"}"
2023-04-04T20:07:13.384575838Z time="2023-04-04T20:07:13Z" level=debug msg="requesting Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 detachment for node2.example.com"
...
2023-04-04T20:07:13.385792532Z time="2023-04-04T20:07:13Z" level=info msg="ControllerUnpublishVolume: rsp: {}"
2023-04-04T20:07:15.386784410Z time="2023-04-04T20:07:15Z" level=debug msg="Polling volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 state for volume unpublished at 2023-04-04 20:07:15.386596264 +0000 UTC m=+2792069.984378910"
2023-04-04T20:07:15.391059508Z time="2023-04-04T20:07:15Z" level=debug msg="Volume pvc-52d816a8-bbb2-4ac8-9e79-0b9950eafdc1 unpublished from node2.example.com"

We are using Longhorn v1.2.2 on Rancher RKE v2.6.5.

We would expect that DeleteVolume would be called, the finalizers would be removed, and the PVC would be deleted, but none of those events occur.

As a workaround we tried forcefully removing the finalizer using the command kubectl patch pvc my-pvc -p '{"metadata":{"finalizers":null}}' --type=merge. This worked, but is not ideal to do every time.

Any ideas about what is wrong? If not, what should be my next steps in investigating this issue?

Upvotes: 1

Views: 1714

Answers (1)

glv
glv

Reputation: 1167

Probably some Pod mounts the volume you are trying to remove.

Check out this old answer of mine: https://stackoverflow.com/a/75768413/21404450

Upvotes: 1

Related Questions