DevEx
DevEx

Reputation: 4571

How to use pentaho kettle to load multiple files from s3 bucket

I want to use the step S3 CSV Input to load multiple files from an s3 bucket then transform and load back into S3. But I can see this step support only one file at once and I need to supply the file names, is there any way to load all files at once by supplying only the bucket name i.e. <s3-bucket-name>/*?

Upvotes: 1

Views: 5247

Answers (2)

Neil McGuigan
Neil McGuigan

Reputation: 48287

Two options:

AWS CLI method

  1. Write a simple shell script that calls AWS CLI. Put it in your path. Call it s3.sh

    aws s3 ls s3://bucket.name/path | cut -c32-
    

    In PDI:

  2. Generate Rows: Limit 1, Fields: Name: process, Type: String, Value s3.sh

  3. Execute a Process: Process field: process, Output Line Delimiter |

  4. Split Field to Rows: Field to split: Result output. Delimiter | New field name: filename

  5. S3 CSV Input: The filename field: filename

S3 Local Sync

Mount the S3 directory to a local directory, using s3fs

If you have many large files in that bucket directory, it wouldn't work so fast...well it might be okay if your PDI runs on an Amazon machine

Then use the standard file reading tools

$ s3fs my-bucket.example.com/path/ ~/my-s3-files -o use_path_request_style -o url=https://s3.us-west-2.amazonaws.com

Upvotes: 1

marabu
marabu

Reputation: 1196

S3-CSV-Input is inspired by CSV-Input and doesn't support multi-file-processing like Text-File-Input does, for example. You'll have to retrieve the filenames first, so you can loop over the filename list as you would do with CSV-Input.

Upvotes: 3

Related Questions