Reputation: 90
Similar to this question, Ray Tune is reporting to me:
ValueError: The actor ImplicitFunc is too large (421 MiB > FUNCTION_SIZE_ERROR_THRESHOLD=95 MiB). Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use ray.put() to put large objects in the Ray object store.
I have no idea what is being captured in my scope. It will report this seemingly no matter what changes I make. I have tried taking a dozen different references out of the function and putting them into Ray's internal storage (ray.get() & ray.put()), and it barely moves the needle. Taking out the model definition, the train/test data, and the folding function still resulted in 421 MiB. Which reference is >400 MiB?
Model Definition:
INPUT_DIM = tch_train.features.shape[1] - 1 #Removing an input feature because the sample weight is included with the input data
OUTPUT_DIM = tch_train.labels.shape[1]
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(INPUT_DIM, OUTPUT_DIM)
def forward(self, input):
output = F.softmax(F.relu(self.fc1(input)), dim=1)
return output
Main function:
K_FOLDS = 5
loss_function = nn.CrossEntropyLoss(reduction='none')
kfold = KFold(n_splits=K_FOLDS, shuffle=True)
fold_indices = [(train_ids, test_ids) for train_ids, test_ids in kfold.split(tch_train)]
fold_indices_ref = ray.put(fold_indices)
tch_train_ref = ray.put(tch_train)
# This function is the "Main stuff" of the machine learning.
# This will be called by RayTune and will be expected to train a machine learning model and report the results.
def objective(config):
optimizer = torch.optim.SGD( # Tune the optimizer
model.parameters(), lr=config["lr"], momentum=config["momentum"]
)
# Make a model for each fold.
fold_models = []
for fold in range(K_FOLDS):
fold_models.append(Net().to("cuda"))
# Epoch loop
while True:
fold_losses = []
for fold in range(K_FOLDS):
train_ids, test_ids = ray.get(fold_indices_ref)[fold]
# Take Epoch sample from the 4/1 train/test fold chunks.
train_subsampler = torch.utils.data.SubsetRandomSampler(train_ids)
test_subsampler = torch.utils.data.SubsetRandomSampler(test_ids)
trainloader = torch.utils.data.DataLoader(ray.get(tch_train_ref), batch_size=config["batch_size"], sampler=train_subsampler)
testloader = torch.utils.data.DataLoader(ray.get(tch_train_ref), batch_size=config["batch_size"], sampler=test_subsampler)
# Iterate over the DataLoader for training data
for i, data in enumerate(trainloader, 0):
# Get inputs
features, targets = data
inputs = features[:,1:]
sample_weights = features[:,0]
# Zero the gradients
optimizer.zero_grad()
# Perform forward pass
outputs = fold_models[fold](inputs)
# Compute loss
loss = loss_function(outputs, targets) * sample_weights
# Perform backward pass
loss.mean().backward()
# Perform optimization
optimizer.step()
# Test on test fold
fold_losses[fold] = 0.0
with torch.no_grad():
# Iterate over the test data and generate predictions
for i, data in enumerate(testloader, 0):
# Get inputs
features, targets = data
inputs = features[:,1:]
sample_weights = features[:,0]
# Generate outputs
outputs = net(inputs)
#Add test loss
fold_losses[fold] += (loss_function(outputs, targets) * sample_weights).sum()
# Report average fold losses
train.report({"averaged_CEL": sum(fold_losses) / float(K_FOLDS)}) # Report to Tune
Tune Config:
search_space = {"lr": ray.tune.loguniform(1e-4, 1e-2), "momentum": ray.tune.uniform(0.1, 0.9)}
algo = OptunaSearch()
tuner = ray.tune.Tuner(
objective,
tune_config=ray.tune.TuneConfig(
metric="averaged_CEL",
mode="min",
search_alg=algo,
),
run_config=ray.train.RunConfig(
stop={"training_iteration": 5},
),
param_space=search_space,
)
results = tuner.fit()
print("Best config is:", results.get_best_result().config)
Upvotes: 0
Views: 83
Reputation: 90
I figured it out.
The lines:
optimizer = torch.optim.SGD( # Tune the optimizer
model.parameters(), lr=config["lr"], momentum=config["momentum"]
)
Were not failing at compile time because model
happens to be a different valid variable with a .parameters()
function further up in the file. model
did not refer to the models I was training in this context.
Upvotes: 0