Reputation: 144
I'm trying to use CfnCluster 1.2.1 for GPU computing and I'm using a custom AMI based on the Ubuntu 14.04 CfnCluster AMI.
Everything is created correctly in the CloudFormation console, although when I submit a new test task to Oracle Grid Engine using qsub from the Master Server, it never gets executed from the queue according to qstat. It stays always in status "qw" and never enters state "r".
It seems to work fine with the Amazon Linux AMI (using user ec2-user instead of ubuntu) and the exact same configuration. Also, the master instance announces the number of remaining tasks to the cluster as a metric, and new compute instances are auto-scaled as a result.
What mechanisms does CfnCluster or Oracle Grid Engine provide to further debug this? I took a look at the log files, but didn't find anything relevant. What could be the cause for this behavior?
Thank you,
Diego
Upvotes: 0
Views: 220
Reputation: 144
I think I found the solution. It seems to be the same issue as the one described in https://github.com/awslabs/cfncluster/issues/86#issuecomment-196966385
I fixed it by adding the following line to the CfnCluster configuration file:
base_os = ubuntu1404
If a custom_ami is specified but no base_os is specified, it defaults to use the Amazon Linux, which uses a different method to configure SGE. There may be problems in the SGE configuration performed by CfnCluster if base_os and custom_ami os are different.
Upvotes: 0
Reputation: 3116
Similar to https://stackoverflow.com/a/37324418/704265
From your qhost output, it looks like your machine "ip-10-0-0-47" is properly configured in SGE. However, on "ip-10-0-0-47" sge_execd is either not running or not configured properly. If it were, qhost would report statistics for "ip-10-0-0-47".
Upvotes: 1