Reputation: 594
What's the recommended approach to integrate zipped files into Foundry? I can see 3 options:
Upvotes: 1
Views: 984
Reputation: 967
Generally I would recommend against 1 & 2. I often even do the opposite of 1 & 2 -- I zip files before ingesting them and never have them in their unzipped form anywhere in a foundry dataset.
If the files are merely compressed with gzip or bzip2, but not tarballs, then foundry allows you to access them transparently, as if they were not compressed at all. For instance like in this example dataset, into which I uploaded a single file, test1.csv.bz2
:
However, this breaks for tarballs or other archiving formats where multiple files are compressed into a single archive. So if you have the option to arrange things so that they are compressed like this, that's the easiest and likely most optimal way.
Otherwise I would recommend approach 3 -- extract the archives in-memory and then write out whatever results you've computed as parquet into the downstream dataset.
Upvotes: 2
Reputation: 16866
In general, Option 3 (unzip in transforms) is the best option.
Option 1 introduces a dependency on some unmanaged, external tool to do the unzipping. If you have access to the box, it's not out of the question that you could maintain something like this, but it's certainly not ideal.
Option 2 is supported by existing plugins (zip and tgz), and at first glance, seems like a pretty good option. The trouble is that doing this work on your agent increases the load on your agent. These are generally running on small on-prem boxes that don't have a lot of memory or compute power, and if you happen to over tax one, it impacts everything running on that agent (if you manage to knock it over, nothing runs; if you don't knock it over, everything runs but more slowly than if you weren't doing that work on the agent).
Option 3 will require Java or Python transforms (something with raw file access), and will be a little more complex than Option 2, but it should be more robust. You've (generally) got a lot more compute power to throw at the problem.
The general wisdom about Data Connection is that agents should do as little work as possible. They should just be transferring data into the platform, where it can be cleaned and transformed.
Upvotes: 1