Frido
Frido

Reputation: 101

Weights and Biases error: The wandb backend process has shutdown

running the colab linked below, I get the following error:

"The wandb backend process has shutdown"

I see nothing suspicious in the way the colab uses wandb and I couldn't find anyone with the same problem. Any help is greatly appreciated. I am using the latest version of wandb in colab.

This is where I set up wandb:

if WANDB:
    wandb.login()

and this is the part where I get the error:

#setup wandb if we're using it

if WANDB:
    experiment_name = os.environ.get("EXPERIMENT_NAME")
    group = experiment_name if experiment_name != "none" else wandb.util.generate_id()

cv_scores = []
oof_data_frame = pd.DataFrame()
for fold in range(1, config.folds + 1):
    print(f"Fold {fold}/{config.folds}", end="\n"*2)
    fold_directory = os.path.join(config.output_directory, f"fold_{fold}")    
    make_directory(fold_directory)
    model_path = os.path.join(fold_directory, "model.pth")
    model_config_path = os.path.join(fold_directory, "model_config.json")
    checkpoints_directory = os.path.join(fold_directory, "checkpoints/")
    make_directory(checkpoints_directory)
    
    #Data collators are objects that will form a batch by using a list of dataset elements as input.
    collator = Collator(tokenizer=tokenizer, max_length=config.max_length)
    
    train_fold = train[~train["fold"].isin([fold])]
    train_dataset = Dataset(texts=train_fold["anchor"].values, 
                            pair_texts=train_fold["target"].values,
                            contexts=train_fold["title"].values,
                            targets=train_fold["score"].values, 
                            max_length=config.max_length,
                            sep=tokenizer.sep_token,
                            tokenizer=tokenizer)
    
    train_loader = DataLoader(dataset=train_dataset, 
                              batch_size=config.batch_size, 
                              num_workers=config.num_workers,
                              pin_memory=config.pin_memory,
                              collate_fn=collator,
                              shuffle=True, 
                              drop_last=False)
    
    print(f"Train samples: {len(train_dataset)}")
    
    validation_fold = train[train["fold"].isin([fold])]
    validation_dataset = Dataset(texts=validation_fold["anchor"].values, 
                                 pair_texts=validation_fold["target"].values,
                                 contexts=validation_fold["title"].values,
                                 targets=validation_fold["score"].values,
                                 max_length=config.max_length,
                                 sep=tokenizer.sep_token,
                                 tokenizer=tokenizer)
    
    validation_loader = DataLoader(dataset=validation_dataset, 
                                   batch_size=config.batch_size*2, 
                                   num_workers=config.num_workers,
                                   pin_memory=config.pin_memory,
                                   collate_fn=collator,
                                   shuffle=True, 
                                   drop_last=False)
    
    print(f"Validation samples: {len(validation_dataset)}")


    model = Model(**config.model)
    
    if not os.path.exists(model_config_path): 
        model.config.to_json_file(model_config_path)
    
    model_parameters = model.parameters()
    optimizer = get_optimizer(**config.optimizer, model_parameters=model_parameters)
    
    training_steps = len(train_loader) * config.epochs
    
    if "scheduler" in config:
        config.scheduler.parameters.num_training_steps = training_steps
        config.scheduler.parameters.num_warmup_steps = training_steps * config.get("warmup", 0)
        scheduler = get_scheduler(**config.scheduler, optimizer=optimizer, from_transformers=True)
    else:
        scheduler = None
        
    model_checkpoint = ModelCheckpoint(mode="min", 
                                       delta=config.delta, 
                                       directory=checkpoints_directory, 
                                       overwriting=True, 
                                       filename_format="checkpoint.pth", 
                                       num_candidates=1)


    if WANDB:
        wandb.init()
        #wandb.init(group=group, name=f"fold_{fold}", config=config)
    
    (train_loss, train_metrics), (validation_loss, validation_metrics, validation_outputs) = training_loop(model=model, 
                                                                                                           optimizer=optimizer, 
                                                                                                           scheduler=scheduler,
                                                                                                           scheduling_after=config.scheduling_after,
                                                                                                           train_loader=train_loader,
                                                                                                           validation_loader=validation_loader,
                                                                                                           epochs=config.epochs, 
                                                                                                           gradient_accumulation_steps=config.gradient_accumulation_steps, 
                                                                                                           gradient_scaling=config.gradient_scaling, 
                                                                                                           gradient_norm=config.gradient_norm, 
                                                                                                           validation_steps=config.validation_steps, 
                                                                                                           amp=config.amp,
                                                                                                           debug=config.debug, 
                                                                                                           verbose=config.verbose, 
                                                                                                           device=config.device, 
                                                                                                           recalculate_metrics_at_end=True, 
                                                                                                           return_validation_outputs=True, 
                                                                                                           logger="tqdm")
    
    if WANDB:
        wandb.finish()
    
    if config.save_model:
        model_state = model.state_dict()
        torch.save(model_state, model_path)
        print(f"Model's path: {model_path}")
    
    validation_fold["prediction"] = validation_outputs.to("cpu").numpy()
    oof_data_frame = pd.concat([oof_data_frame, validation_fold])
    
    cv_monitor_value = validation_loss if config.cv_monitor_value == "loss" else validation_metrics[config.cv_monitor_value]
    cv_scores.append(cv_monitor_value)
    
    del model, optimizer, validation_outputs, train_fold, validation_fold
    torch.cuda.empty_cache()
    gc.collect()
    
    print(end="\n"*6)

Upvotes: 8

Views: 6070

Answers (2)

Warkaz
Warkaz

Reputation: 903

TDLR; Check if the generated id is unique in the project space of wandb you are using.

Explanation

You can check the exact reason this happened in the log files under the wandb folder and specific run id. Had the same issue with Error communicating with wandb process and The wandb backend process has shutdown.

My problem was that I was assigning the run id to a specific instance which already existed, and rerunning the whole search space, but the run id have to be unique. Using name in init is generally a safer bet if you don't intend to continue the previous run (which is possible if you indicate so in the init method).

You can try running Wandb in offline mode, to see if this can help, and later on doing wandb sync.

Upvotes: 1

Vishwas
Vishwas

Reputation: 1

Solution which worked for me is run !wandb login --relogin.

Upvotes: 0

Related Questions