deploying mongodb on a batch queued cluster, persistent data but hostnames will change each subsequent job

Question

I'm trying to run a sharded MongoDB cluster ontop of a HPC cluster that jobs are submitted through batch job queuing and a shared file system across all compute nodes. Every time a job is submitted, there is no guarantee which hosts the code will run on.

I'm running a data analysis on a massively parallel super computer. The analysis benefits from having its data hosted in a queryable database like mongodb, that is, a distributed datastore. I would like to use a portion of the assigned compute nodes to run the shareded mongodb cluster, and the rest to run the analysis.

There will be at least 2 batch jobs to this process.

1) run mongo config serve, shards, and routers on a subset of assigned compute nodes. Then run ingest scripts to load data on the remaining compute nodes concurrently. When this job is finished, it will terminate all the mongo daemons and release the compute resources back to the system.

2) request another job to run mongo config, shards, and routers on a subset of assigned computes nodes. Then on the remaining compute nodes the analysis code will run concurrently, querying the adjoined Mongodb cluster.

The hope is that I only have to run 1) once, but then I can run 2) with whatever crazy analysis I dream up.

At the moment I have 1) running well. However, the hostnames get saved in rs.conf(). For 2), when I try to restart the cluster in a subsequent job I get a lot of

2019-02-18T13:32:38.946-0600 I NETWORK  [Balancer] Cannot reach any nodes for set monitoringMetricsShard3. Ple
ase check network connectivity and the status of the set. This has happened for 10 checks in a row.
2019-02-18T13:32:39.457-0600 W NETWORK  [Balancer] Unable to reach primary for set monitoringMetricsShard3
2019-02-18T13:32:39.457-0600 I NETWORK  [Balancer] Cannot reach any nodes for set monitoringMetricsShard3. Ple
ase check network connectivity and the status of the set. This has happened for 11 checks in a row.
2019-02-18T13:32:39.968-0600 W NETWORK  [Balancer] Unable to reach primary for set monitoringMetricsShard3
2019-02-18T13:32:40.480-0600 W NETWORK  [Balancer] Unable to reach primary for set monitoringMetricsShard3

I found this, MongoDB Replica Set Member State is "OTHER" Which helps fix the config server. But i am only able to mongo shell into the config server and mongos router. mongo shelling into the shards gets refused.

TL;DR I want to keep the shared data of a mongodb cluster persistently saved in the file system, but the daemons will be routinely restarted on different hosts.

deploying mongodb on a batch queued cluster, persistent data but hostnames will change each subsequent job

Answers (1)

Related Questions