Kubernetes and GPU node cluster implementation practices

Question

I am trying to understand the K8s gpu practices better, and implementing a small K8s GPU cluster which is suppose to work like below.

This going to be little long explanation, but I hope it will help to have many questions at once place to understand GPU practices better in Kubernetes.

Application Requirement

I want to create a K8s autoscale cluster.
Pods are running the models say a tensorflow based deep learning program.
Pods are waiting for a message in pub sub queue to appear and it can proceed its execution once it recieves a message.
Now a message is queued in a PUB/SUB queue.
As message is available, pods reads it and execute deep learning program.

Cluster requirement

If no message is present in queue and none of the GPU based pods are executing program( i mean not using gpu), then gpu node pool should scale down to 0.

Design 1

Create a gpu node pool. Each node contains N GPU, where N >= 1. Assign model trainer pod to each gpu. That is 1:1 mapping of pods and GPU. When I tried assigning 2 pods to 2 GPU machine where each pod is suppose to run a mnist program.

What I noticed is

1 pod got allocated and executes the program and later it went into crash loop. May be I am doing some mistake as my docker image is suppose to run program once only as I was just doing feasibility test of running 2 pods simultaneously on 2 gpu of same node.Below is the error

 Message   Reason  First Seen  Last Seen   Count
Back-off restarting failed container    BackOff Jun 21, 2018, 3:18:15 PM    Jun 21, 2018, 4:16:42 PM    143
pulling image "nkumar15/mnist"  Pulling Jun 21, 2018, 3:11:33 PM    Jun 21, 2018, 3:24:52 PM    5
Successfully pulled image "nkumar15/mnist"  Pulled  Jun 21, 2018, 3:12:46 PM    Jun 21, 2018, 3:24:52 PM    5
Created container   Created Jun 21, 2018, 3:12:46 PM    Jun 21, 2018, 3:24:52 PM    5
Started container   Started Jun 21, 2018, 3:12:46 PM    Jun 21, 2018, 3:24:52 PM    5

The other pod didn't get assigned at all to GPU. Below is the message from pod events

0/3 nodes are available: 3 Insufficient nvidia.com/gpu.

Design 2

Have multiple GPU machines in gpu node pool with each node having only 1 GPU.
K8s, will assign each pod to each available GPU in node and hopefully there won't be any issue. I am yet to try this.

Questions

Is there any suggested practice to design above type of system in kubernetes as of version 1.10?
Is Design 1 approach not feasible as of 1.10 release? For eg, I have 2 GPU node with 24 GB GPU memory, is it possible such that K8s assign 1 pod to each GPU and each pods execute its own workload with 12GB memory limit each?
How do I scale down gpu node pool to 0 size through autoscaler?
In Design 2, say what if I run out of GPU memory? as curently in GCP 1 GPU node doesn't have more than 16 GB memory.

Again apologies for such a long question, but I hope it will help other also.

Updates

For question 2 I created a new cluster to reproduce same issue which I faced multiple times before, I am not sure what changed this time but 2nd pod is successfully allocated a GPU. I think with this result I can confirm that 1gpu-1pod mapping is allowed in a multi gpu single node However restricting memory per gpu process is not feasible as of 1.10.