Reputation: 1458
I have pretty big (~200Gb, ~20M lines) raw jsonl dataset. I need to extract important properties from there and store the intermediate dataset in csv for further conversion into something like HDF5, parquet, etc. Obviously, I can't use JSONDataSet
for loading raw dataset, because it utilizes pandas.read_json
under the hood, and using pandas for the dataset of such size sounds like a bad idea. So I'm thinking about reading the raw dataset line by line, process and append processed data line by line to the intermediate dataset.
What I can't understand is how to make this compatible with AbstractDataSet
with its _load
and _save
methods.
P.S. I understand I can move this out of kedro's context, and introduce preprocessed dataset as a raw one, but that kinda breaks the whole idea of complete pipelines.
Upvotes: 6
Views: 1124
Reputation: 558
Try to use pyspark to leverage lazy evaluation and batch execution. SparkDataSet is implemented in kedro.contib.io.spark_data_set
Sample catalog config for jsonl:
your_dataset_name:
type: kedro.contrib.io.pyspark.SparkDataSet
filepath: "\file_path"
file_format: json
load_args:
multiline: True
Upvotes: 4