Anton Kirilenko
Anton Kirilenko

Reputation: 169

Are S3 Kedro datasets thread-safe?

CSVS3DataSet/HDFS3DataSet/HDFS3DataSet use boto3, which is known to be not thread-safe https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html?highlight=multithreading#multithreading-multiprocessing

Is it OK to use these datasets with the ParallelRunner?

Upvotes: 2

Views: 221

Answers (1)

Anton Kirilenko
Anton Kirilenko

Reputation: 169

Kedro uses s3fs, which uses boto3 library to access S3. Boto3 is not thread-safe indeed, but only if you are trying to reuse the same Session object.

All Kedro S3 datasets maintain separate instances of S3FileSystem, which means separate boto sessions, so it's safe.

It's probably not great in terms of performance, and if you work with hundreds of S3 data sets in parallel, or thousands of small S3 datasets sequentially - the pipeline might run quite long and even fail on connection errors, but you are totally safe with a few dozens of them.

Upvotes: 2

Related Questions