Jinxed NZ
Jinxed NZ

Reputation: 35

How to resolve unresponsive/failing bosh-lite cloud foundry vms?

I am (trying to!) learn Cloud Foundry using bosh-lite on a MacBook Pro. I manage to get it running however every time after starting from scratch it stops working, I suspect this is associated with stopping the [virtualbox] VM / putting the laptop to sleep, but can't confirm if this is definitely the case.

My experience is limited and I'm having difficulties in not just resolving the issue, but also in understanding what is going wrong. Apologies if this is an obvious problem, but I haven't been able to clearly determine how to stop this problem from happening, and the only solution I've had so far is to destroy the deployment using Vagrant and then starting from scratch - which takes a while and isn't the optimal fix I'm sure. :)

I've noticed that the 'bosh vms' show unresponsive agents and that they're not staring properly. The error in bosh cck indicates a locking issue, but I suspect that this may be a misnomer as running bosh locks indicates that there are no locks. Once again, I'm a newbie, so this may simply be a misunderstanding ...

Help - how do I fix this!! Is there a way to quickly 'reset' to a working state? (vagrant reload --provision doesn't help) Where exactly is the issue?

Also, what is the (default) root password for the vagrant cloudfoundry/bosh-lite VM?

> bosh vms

+---------------------------------------------------------------------------+--------------------+-----+-----------+--------------+
| VM                                                                        | State              | AZ  | VM Type   | IPs          |
+---------------------------------------------------------------------------+--------------------+-----+-----------+--------------+
| api_z1/0 (8dfeb143-59b1-46dd-9482-e90931a70a0d)                           | unresponsive agent | n/a | large_z1  | 10.244.0.138 |
| blobstore_z1/0 (7795ce02-d64e-4cc7-be1e-0e328384d568)                     | unresponsive agent | n/a | medium_z1 | 10.244.0.130 |
| consul_z1/0 (e92f6bfd-f623-4ba4-abf3-3d4baa0953fa)                        | unresponsive agent | n/a | small_z1  | 10.244.0.54  |
| doppler_z1/0 (049eaa18-3d4f-48d8-92ed-ea4b6a20cd29)                       | unresponsive agent | n/a | medium_z1 | 10.244.0.146 |
| etcd_z1/0 (e45a7648-e43d-4753-8a18-3ab21b86293d)                          | unresponsive agent | n/a | large_z1  | 10.244.0.42  |
| ha_proxy_z1/0 (ba6e8ce6-8f40-4868-8a71-c74119f173ea)                      | failing            | n/a | router_z1 | 10.244.0.34  |
| hm9000_z1/0 (ff8ae6a3-1889-4fb0-aabf-072012cf9f48)                        | unresponsive agent | n/a | medium_z1 | 10.244.0.142 |
| loggregator_trafficcontroller_z1/0 (8f2e4ea1-dda7-4d15-9050-528338824e3b) | unresponsive agent | n/a | small_z1  | 10.244.0.150 |
| nats_z1/0 (9e4eab32-ac91-4f05-83be-b8189c2991e7)                          | unresponsive agent | n/a | medium_z1 | 10.244.0.6   |
| postgres_z1/0 (fb8d1eee-3ade-480e-aa01-3db26a64b447)                      | unresponsive agent | n/a | medium_z1 | 10.244.0.30  |
| router_z1/0 (f9ce017b-580f-4fce-b79d-01ceef190e19)                        | unresponsive agent | n/a | router_z1 | 10.244.0.22  |
| runner_z1/0 (c0b0871b-c672-46c8-ac4a-1aabd81864f6)                        | unresponsive agent | n/a | runner_z1 | 10.244.0.26  |
| uaa_z1/0 (63b4bfa7-499d-4dba-93f6-2017b04a7588)                           | unresponsive agent | n/a | medium_z1 | 10.244.0.134 |
+---------------------------------------------------------------------------+--------------------+-----+-----------+--------------+



> bosh cck

Acting as user 'admin' on deployment 'cf-warden' on 'Bosh Lite Director'
Performing cloud check...

Director task 96
Error 100: Unable to get deployment lock, maybe a deployment is in progress. Try again later.

Task 96 error

For a more detailed error report, run: bosh task 96 --debug

> bosh locks

Acting as user 'admin' on 'Bosh Lite Director'

No locks

It is possible to do a 'reset' and get it up and running again using the commands below, but this takes quite some time and is surely more of a 'hammer' than is required!

# bosh-lite dir 
vagrant destroy && vagrant up

# cd cf-release dir 
bosh upload release
bosh deploy 

# cd bosh-lite dir
bin/add-route
cf api --skip-ssl-validation https://api.bosh-lite.com
cf create-org my_org
cf create-space development -o my_org

Upvotes: 0

Views: 1358

Answers (3)

user5582395
user5582395

Reputation: 1

It is recommended that we pause the Bosh-lite VM when its not in use so that it can simply be resumed after the system goes to sleep/get rebooted; otherwise VM will be halted by OS (Bosh-lite VM goes in aborted state). Running vagrant up on aborted bosh-lite, gets it running but in that case CF VMs go in unresponsive state which requires redeployment.

Running vagrant suspend for pausing and vagrant resume when restarting the work helps avoid the situation with unresponsive CF VMs.

Upvotes: 0

hsiliev
hsiliev

Reputation: 66

I usually do vagrant suspend and then vagrant up to avoid a situation with dead containers/VMs inside BOSH Lite.

You can do bosh cck but my experience shows that a simple deployment recreate is much faster and also more reliable.

Upvotes: 0

dkoper
dkoper

Reputation: 1485

You can use sudo su after ssh'ing into the bosh-lite VM with vagrant ssh to become root without needing to enter a root password.

BOSH-lite has always been hard to resurrect after a VM reboot/sleep.
Someone recently (Dec 2016) wrote a utility to "gracefully put machines running BOSH Lite to sleep" and restore it on system wake, to address it: https://github.com/henryaj/ambient

Upvotes: 0

Related Questions