Reputation: 759

Is there a way to get log the descriptive stats of a dataset using MLflow?

Is there a way to get log the descriptive stats of a dataset using MLflow? If any could you please share the details?

Upvotes: 9

Answers (4)

Seppo Enarvi

Reputation: 3663

MLflow has an experimental data API. The idea is that you derive your dataset class from mlflow.data.dataset.Dataset, or use one of the dataset classes that they provide for common data storage formats. These classes may support downloading the data from a given source. For example:

import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset

dataset_source_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(dataset_source_url)
dataset = mlflow.data.from_pandas(df, source=dataset_source_url)

If you pass such a dataset to mlflow.log_input(), the function will log the dataset source, statistics, digest, etc. For example:

with mlflow.start_run():
    mlflow.log_input(dataset, context="training")

If you look at the log_input() source code, you can see that it converts the mlflow.data.dataset.Dataset to an mlflow.entities.Dataset object and combines it with any tags to create a mlflow.entities.DatasetInput object, and then uses MlflowClient.log_inputs to log the values. mlflow.entities.Dataset is just a simple data structure that contains textual values to be logged. If integrating the data API to your code base seems too complicated, you can use MlflowClient.log_inputs() directly this way.

Upvotes: 4

Maciej Skorski

Reputation: 3384

As the answers pointed out, MLFlow allows for uploading any local files. But the good practice is to dump to and upload from temporary files.

The advantage over the accepted answer are: no leftovers, and no issues with parallelization.

  with tempfile.TemporaryDirectory() as tmpdir:
    fname = tmpdir+'/'+'bits_corr_matrix.csv'
    np.savetxt(fname,corr_matrix,delimiter=',')
    mlflow.log_artifact(fname)

Upvotes: 1

Adrien Pacifico

Reputation: 2019

There is also the possibility to log the artifact as an html file such that it is displayed as an (ugly) table in mlflow.

import seaborn as sns
import mlflow

mlflow.start_run()
df_iris = sns.load_dataset("iris")
df_iris.describe().to_html("iris.html")
mlflow.log_artifact("iris.html",
                    "stat_descriptive")
mlflow.end_run()

Upvotes: 6

Raphael K

Reputation: 2353

Generally speaking you can log arbitrary output from your code using the mlflow_log_artifact() function. From the docs:

mlflow.log_artifact(local_path, artifact_path=None) Log a local file or directory as an artifact of the currently active run.

Parameters:
local_path – Path to the file to write. artifact_path – If provided, the directory in artifact_uri to write to.

As an example, say you have your statistics in a pandas dataframe, stat_df.

## Write csv from stats dataframe
stat_df.to_csv('dataset_statistics.csv')

## Log CSV to MLflow
mlflow.log_artifact('dataset_statistics.csv')

This will show up under the artifacts section of this MLflow run in the Tracking UI. If you explore the docs further you'll see that you can also log an entire directory and the objects therein. In general, MLflow provides you a lot of flexibility - anything you write to your file system you can track with MLflow. Of course that doesn't mean you should. :)

Upvotes: 17

Is there a way to get log the descriptive stats of a dataset using MLflow?

Answers (4)

Related Questions