Reputation: 503
I was looking at iris
project example provided by kedro. Apart from logging the accuracy I also wanted to save the predictions
and test_y
as a csv.
This is the example node provided by kedro.
def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
"""Node for reporting the accuracy of the predictions performed by the
previous node. Notice that this function has no outputs, except logging.
"""
# Get true class index
target = np.argmax(test_y.to_numpy(), axis=1)
# Calculate accuracy of predictions
accuracy = np.sum(predictions == target) / target.shape[0]
# Log the accuracy of the model
log = logging.getLogger(__name__)
log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)
I added the following to save the data.
data = pd.DataFrame({"target": target , "prediction": predictions})
data_set = CSVDataSet(filepath="data/test.csv")
data_set.save(data)
This works as intended, however, my question is "is it the kedro way of doing thing" ? Can I provide the data_set
in catalog.yml
and later save data
to it? If I want to do it, how do I access the data_set
from catalog.yml
inside a node.
Is there a way to save data without creating a catalog inside a node like this data_set = CSVDataSet(filepath="data/test.csv")
? I want this in catalog.yml
, if possible and if it follows kedro convention!.
Upvotes: 0
Views: 1267
Reputation: 1516
Kedro actually abstracts this part for you. You don't need to access the datasets via their Python API.
Your report_accuracy
method does need to be tweaked to return the DataFrame
instead of None
.
Your node needs to be defined as such:
node(
func=report_accuracy,
inputs='dataset_a',
outputs='dataset_b'
)
Kedro then looks at your catalog and will load/save dataset_a
and dataset_b
as required:
dataset_a:
type: pandas.CSVDataSet
path: xxxx.csv
dataset_b:
type: pandas.ParquetDataSet
path: yyyy.pq
As you run the node/pipeline Kedro will handle the load/save operations for you. You also don't need to save every dataset if it's only used mid-way in a pipeline, you can read about MemoryDataSet
s here.
Upvotes: 7