Docker - how to use a saved file created in the container

Question

Objective: train a machine learning model in a .py (train_model.py) file, save the model to a .joblib file (Inference_xgb.joblib), load the model into another .py (Inference.py) file, use the model to make predictions and save the output.

Issue: Inference.py cannot find the Inference_xgb.joblib file.

Relevant code snippets:

Training (train_model.py):

#!/usr/bin/python3

import pandas as pd
from xgboost import XGBClassifier
from joblib import dump

def train():
    # load in and read training data
    training = './train.csv'
    data_train = pd.read_csv(training)
    label = data_train['2020 Failure'] # what we want to predict
    features = data_train.drop(['2020 Failure', 'FACILITYID'], axis =1, inplace=False) # what we train on the model to learn
    features = features.drop('Unnamed: 0', axis=1)
    x_train = features
    y_train = label

    # XGBoost model training
    xgb_model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")
    xgb_model.fit(x_train, y_train)
    # save model
    dump(xgb_model, 'Inference_xgb.joblib')

if __name__== '__main__':
    train()

Testing (Inference.py):

#!/usr/bin/python3

import pandas as pd
from joblib import load
from sklearn.metrics import confusion_matrix
import os

def inference():
    # load and read in test data
    testing = './test.csv'
    data_test = pd.read_csv(testing)

    label = data_test['2020 Failure'] # what we want to predict
    features = data_test.drop(['2020 Failure', 'FACILITYID'], axis =1 ) # what we train on the model to learn
    features = features.drop('Unnamed: 0', axis=1)
    IDS = data_test['FACILITYID']
    x_test = features
    y_test = label

    # run model
    xgb_model = load('Inference_xgb.joblib')
    y_label = xgb_model.predict(x_test)
    cm = confusion_matrix(y_test,y_label)
    print("Confusion Matrix: ")
    print(cm)

    # write results
    dirpath = os.getcwd()
    print('CURRENT PATH: ', dirpath)
    output_path = os.path.join(dirpath, 'output.csv')
    output_df = pd.DataFrame(y_label, columns=['Prediction'])
    output_df.insert(0, "FACILITYID", IDS.values)
    output_df.to_csv(output_path)
    print('OUTPUT DF')
    print(output_df)

if __name__ == "__main__":
    inference()

Dockerfile:

FROM jupyter/scipy-notebook 

RUN pip install joblib
RUN pip install xgboost==1.5.0

USER root

WORKDIR /scaleable-model

COPY train.csv ./train.csv
COPY test.csv ./test.csv

COPY train_model.py ./train_model.py
COPY inference.py ./inference.py

RUN python3 train_model.py

Comments, observations, and what I've tried:

I've noticed that removing WORKDIR /scaleable-model fixes the issue, but I want to keep the WORKDIR to /scaleable-model so I can mount the .csv output to my host machine.

I am running docker build in the scaleable-model directory on my host machine. That is, I cd to /home/user/pathto/scaleable-model and run docker build -t scaleable-model -f Dockerfile .

I then call docker run and specify I want to call Inference.py, this is how the error is generated.

I've tried hardcoded paths as well but this did not help. I also created a Inference_xgb.joblib on my host machine in the same directory where I am building the container, but this did nothing either.

I suspect that either:

the Inference_xgb.joblib file is not being created properly in the container
I am messing up the directory structure somehow inside the container and thus Inference.py cannot find the file.

To quote Michael Burry, "I guess when someone's wrong, they never know how". I'd like to try to understand the how here.

EDIT: Checking the contents of the container, the file (Inference_xgb.joblib) IS being created in the directory that I want (/scaleable-model). Therefore, it must be an issue with Inference.py` not detecting the file for some reason.

Docker - how to use a saved file created in the container

Answers (1)

Related Questions