misterbee
misterbee

Reputation: 5182

Is there a way to load a Gzipped file from Amazon S3 into Pentaho (PDI / Spoon / Kettle)?

Is there a way to load a Gzipped file from Amazon S3 into Pentaho Data Integration (Spoon)?

There is a "Text File Input" that has a Compression attribute that supports Gzip, but this module can't connect to S3 as a source.

There is an "S3 CSV Input" module, but no Compression attribute, so it can't decompress the Gzipped content into tabular form.

Also, there is no way to save the data from S3 to a local file. The downloaded content can only be "hopped" to another Step, but no Step can read gzipped data from a previous Step, the Gzip-compatible steps all read only from files.

So, I can get gzipped data from S3, but I can't send that data anywhere that can consume it.

Am I missing something? Is there a way to unzip zipped data from a non-file source?

Upvotes: 3

Views: 3804

Answers (3)

Milos
Milos

Reputation: 192

Kettle uses VFS (Virtual File System) when working with files. Therefore, you can fetch a file through http, ssh, ftp, zip, ... and use it as a regular, local file in all the steps that read files. Just use the right "url". You will find more here and here, and a very nice tutorial here. Also, check out VFS transformation examples that come with Kettle.

This is url template for S3: s3://<Access Key>:<Secret Access Key>@s3<file path>

In your case, you would use "Text file input" with compression settings you mentioned and selected file would be:

s3://aCcEsSkEy:SecrEttAccceESSKeeey@s3/your-s3-bucket/your_file.gzip

Upvotes: 2

Sagar
Sagar

Reputation: 51

You can also try with GZIP input control in peanatho kettle it is there.

Upvotes: 1

rsilva4
rsilva4

Reputation: 1955

I really don't know how but if you really need this you can look for using S3 through VFS capabilities that Pentaho Data Integration provides. I can se a vsf-providers.xml with the following content in my PDI CE distribution:

../data-integration/libext/pentaho/pentaho-s3-vfs-1.0.1.jar

<providers>
  <provider class-name="org.pentaho.s3.vfs.S3FileProvider">
    <scheme name="s3"/>
    <if-available class-name="org.jets3t.service.S3Service"/>
  </provider>
</providers>

Upvotes: 1

Related Questions