Peter Jung
Peter Jung

Reputation: 23

GCP - Cannot SSH into fresh GPU Deep Learning VM instance

If I create a fresh GCE VM instance with a GPU and a GPU-optimized Debian image, I cannot SSH into it, neither via the browser SSH window or using a third party SSH client (after uploading public key).

I have tried the suggestion here but it did not help.

If I create the instance without a GPU and with a standard Ubuntu image, everything works fine out of the box.

Is there something I am missing about GPU Deep Learning instances?

Edit:

GCloud command to recreate:

gcloud beta compute --project=avid-compound-233309 instances create instance-1 --zone=us-central1-a --machine-type=n1-standard-1 --subnet=default --network-tier=PREMIUM --maintenance-policy=TERMINATE --service-account=105060870131-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=type=nvidia-tesla-k80,count=1 --image=c0-common-gce-gpu-image-20191213 --image-project=ml-images --boot-disk-size=50GB --boot-disk-type=pd-standard --boot-disk-device-name=instance-1 --reservation-affinity=any

And yes it happens right after creation of VM and there is a big log of errors in the Serial Port 1 Log, short example:

[    9.393769] google_accounts_daemon[692]:     File "<frozen importlib._bootstrap>", line 574, in module_from_spec
[    9.394022] google_accounts_daemon[692]:   AttributeError: 'NoneType' object has no attribute 'loader'
[    9.394250] google_accounts_daemon[692]: Remainder of file ignored
[    9.394504] google_accounts_daemon[692]: Traceback (most recent call last):
[    9.394767] google_accounts_daemon[692]:   File "/usr/bin/google_accounts_daemon", line 6, in <module>
[    9.395108] google_accounts_daemon[692]:     from pkg_resources import load_entry_point
[    9.395344] google_accounts_daemon[692]:   File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
[    9.395502] google_accounts_daemon[692]:     from pkg_resources.extern import six
[    9.395719] google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "/usr/lib/python3.5/site.py", line 173, in addpackage
Dec 23 19:40:05 localhost google_accounts_daemon[692]:       exec(line)
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "<string>", line 1, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "<frozen importlib._bootstrap>", line 574, in module_from_spec
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   AttributeError: 'NoneType' object has no attribute 'loader'
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Remainder of file ignored
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Traceback (most recent call last):
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   File "/usr/bin/google_accounts_daemon", line 6, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     from pkg_resources import load_entry_point
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     from pkg_resources.extern import six
Dec 23 19:40:05 localhost google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'

Upvotes: 1

Views: 429

Answers (1)

mebius99
mebius99

Reputation: 2605

It seems the freshly published image "GPU Optimized Debian m32 (with CUDA 10.0) (c0-common-gce-gpu-image-20191213)" contains damaged EXT filesystem. Directories, configuration and script files contain garbage. Hence initial configuration at first boot fails.

Started Flush Journal to Persistent Storage.
Starting Create Volatile Files and Directories...
[ 4.880071] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 144, inode_bitmap = 4718608
[ 4.883559] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 145, inode_bitmap = 4718609
[ 4.887054] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 146, inode_bitmap = 4718610
...
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ecdsa_key.pub is not a public key file.
localhost dhclient[516]: 
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ed25519_key.pub is not a public key file.
localhost ssh-generate-hostk[ [0;32m  OK   [0m] Started Getty on tty1.
...
keys[485]: /etc/ssh/ssh_host_rsa_key.pub is not a public key file.

There is a recently created public issue at the Public Issue Tracker: https://issuetracker.google.com/146807209

It should be fixed soon.

Upvotes: 1

Related Questions