Reputation: 395
I am developing a system which needs to train dozens of individual models (>50) using Lightning, each with their own TensorBoard plots and logs. My current implementation has one Trainer object per model and it seems like I'm running into this error when I go over ~90 Trainer objects. Interestingly, the error only appears when I run the .test() method, not during .fit():
Traceback (most recent call last):
File "lightning/main_2.py", line 193, in <module>
main()
File "lightning/main_2.py", line 174, in main
new_trainer.test(model=new_model, test_dataloaders=te_loader)
File "\Anaconda3\envs\pyenv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1279, in test
results = self.__test_given_model(model, test_dataloaders)
File "\Anaconda3\envs\pyenv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1343, in __test_given_model
self.set_random_port(force=True)
File "\Anaconda3\envs\pyenv\lib\site-packages\pytorch_lightning\trainer\distrib_data_parallel.py", line 398, in set_random_port
default_port = RANDOM_PORTS[-1]
IndexError: index -1 is out of bounds for axis 0 with size 0
As I just started with Lightning, I am not sure if having one Trainer/model is the best approach. However, I require individual plots from each model, and it seems that if I use a single trainer for multiple models the results get overridden.
For reference, I'm defining different lists of trainers as such:
for i in range(args["num_users"]):
trainer_list_0.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
fast_dev_run=args["fast_dev_run"], weights_summary=None))
trainer_list_1.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
fast_dev_run=args["fast_dev_run"], weights_summary=None))
trainer_list_2.append(Trainer(max_epochs=args["epochs"], gpus=1, default_root_dir=args["save_path"],
fast_dev_run=args["fast_dev_run"], weights_summary=None))
As for training:
for i in range(args["num_users"]):
trainer_list_0[i].fit(model_list_0[i], train_dataloader=dataloader_list[i],
val_dataloaders=val_loader)
trainer_list_1[i].fit(model_list_1[i], train_dataloader=dataloader_list[i],
val_dataloaders=val_loader)
trainer_list_2[i].fit(model_list_2[i], train_dataloader=dataloader_list[i],
val_dataloaders=val_loader)
And testing:
for i in range(args["num_users"]):
trainer_list_0[i].test(test_dataloaders=te_loader)
trainer_list_1[i].test(test_dataloaders=te_loader)
trainer_list_2[i].test(test_dataloaders=te_loader)
Thanks!
Upvotes: 1
Views: 2273
Reputation: 1091
As far as I know, only one model per Trainer
is expected. You can explicitly pass TensorBoardLogger
object to Trainer
with pre-defined experiment name and version so as to keep plots separate (see docs).
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import TensorBoardLogger
logger = TensorBoardLogger("tb_logs", name="my_model", version="version_XX")
trainer = Trainer(logger=logger)
The problem you have faced is related to ddp module somehow. Its source code contains the following lines [1], [2]:
RANDOM_PORTS = RNG1.randint(10000, 19999, 1000)
def set_random_port(self, force=False):
...
default_port = RANDOM_PORTS[-1]
RANDOM_PORTS = RANDOM_PORTS[:-1]
if not force:
default_port = os.environ.get('MASTER_PORT', default_port)
I'm not sure why you're facing the issue with 90+ Trainer
s, but you could try to drop this line:
RANDOM_PORTS = RANDOM_PORTS[:-1]
Upvotes: 3