Reputation: 169
CSVS3DataSet
/HDFS3DataSet
/HDFS3DataSet
use boto3
, which is known to be not thread-safe https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html?highlight=multithreading#multithreading-multiprocessing
Is it OK to use these datasets with the ParallelRunner?
Upvotes: 2
Views: 221
Reputation: 169
Kedro
uses s3fs
, which uses boto3
library to access S3. Boto3
is not thread-safe indeed, but only if you are trying to reuse the same Session object.
All Kedro
S3 datasets maintain separate instances of S3FileSystem
, which means separate boto sessions, so it's safe.
It's probably not great in terms of performance, and if you work with hundreds of S3 data sets in parallel, or thousands of small S3 datasets sequentially - the pipeline might run quite long and even fail on connection errors, but you are totally safe with a few dozens of them.
Upvotes: 2