shamalaia
shamalaia

Reputation: 2351

MLFLow artifact logging and retrieve on remote server

I am trying to setup a MLFlow tracking server on a remote machine as a systemd service. I have a sftp server running and created a SSH key pair.

Everything seems to work fine except the artifact logging. MLFlow seems to not have permissions to list the artifacts saved in the mlruns directory.

I create an experiment and log artifacts in this way:

uri = 'http://192.XXX:8000' 
mlflow.set_tracking_uri(uri)

mlflow.create_experiment('test', artifact_location='sftp://192.XXX:_path_to_mlruns_folder_')

experiment=mlflow.get_experiment_by_name('test')
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name=run_name) as run:
       mlflow.log_param(_parameter_name_, _parameter_value_)     
       mlflow.log_artifact(_an_artifact_, _artifact_folder_name_)

I can see the metrics in the UI and the artifacts in the correct destination folder on the remote machine. However, in the UI I receive this message when trying to see the artifacts:

Unable to list artifacts stored under sftp://192.XXX:path_to_mlruns_folder/run_id/artifacts for the current run. Please contact your tracking server administrator to notify them of this error, which can happen when the tracking server lacks permission to list artifacts under the current run's root artifact directory.

I cannot figure out why as the mlruns folder has drwxrwxrwx permissions and all the subfolders have drwxrwxr-x. What am I missing?


UPDATE Looking at it with fresh eyes, it seems weird that it tries to list files through sftp://192.XXX:, it should just look in the folder _path_to_mlruns_folder_/_run_id_/artifacts. However, I still do not know how to circumvent that.

Upvotes: 3

Views: 4148

Answers (1)

shamalaia
shamalaia

Reputation: 2351

The problem seems to be that by default the systemd service is run by root. Specifying a user and creating a ssh key pair for that user to access the same remote machine worked.

[Unit]

Description=MLflow server

After=network.target 

[Service]

Restart=on-failure

RestartSec=20

User=_user_

Group=_group_

ExecStart=/bin/bash -c 'PATH=_yourpath_/anaconda3/envs/mlflow_server/bin/:$PATH exec mlflow server --backend-store-uri postgresql://mlflow:mlflow@localhost/mlflow --default-artifact-root sftp://[email protected]:_yourotherpath_/MLFLOW_SERVER/mlruns -h 0.0.0.0 -p 8000' 

[Install]

WantedBy=multi-user.target

_user_ and _group_ should be the same listed by ls -la in the mlruns directory.

Upvotes: 2

Related Questions