Saving data with DataCatalog

Question

I was looking at iris project example provided by kedro. Apart from logging the accuracy I also wanted to save the predictions and test_y as a csv.

This is the example node provided by kedro.

def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
    """Node for reporting the accuracy of the predictions performed by the
    previous node. Notice that this function has no outputs, except logging.
    """
    # Get true class index
    target = np.argmax(test_y.to_numpy(), axis=1)
    # Calculate accuracy of predictions
    accuracy = np.sum(predictions == target) / target.shape[0]
    # Log the accuracy of the model
    log = logging.getLogger(__name__)
    log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)

I added the following to save the data.

data = pd.DataFrame({"target": target , "prediction": predictions})
data_set = CSVDataSet(filepath="data/test.csv")
data_set.save(data)

This works as intended, however, my question is "is it the kedro way of doing thing" ? Can I provide the data_set in catalog.yml and later save data to it? If I want to do it, how do I access the data_set from catalog.yml inside a node.

Is there a way to save data without creating a catalog inside a node like this data_set = CSVDataSet(filepath="data/test.csv") ? I want this in catalog.yml, if possible and if it follows kedro convention!.

datajoely · Accepted Answer

Kedro actually abstracts this part for you. You don't need to access the datasets via their Python API.

Your report_accuracy method does need to be tweaked to return the DataFrame instead of None.

Your node needs to be defined as such:

node(
  func=report_accuracy,
  inputs='dataset_a',
  outputs='dataset_b'
)

Kedro then looks at your catalog and will load/save dataset_a and dataset_b as required:

dataset_a:
   type: pandas.CSVDataSet
   path: xxxx.csv

dataset_b:
   type: pandas.ParquetDataSet
   path: yyyy.pq

As you run the node/pipeline Kedro will handle the load/save operations for you. You also don't need to save every dataset if it's only used mid-way in a pipeline, you can read about MemoryDataSets here.

Saving data with DataCatalog

Answers (1)

Related Questions