pkal
pkal

Reputation: 45

How to tar files with a size limit and write to a remote location?

I need to move large number of files to to S3 with the time-stamps intact (c-time, m-time etc need to be intact => I cannot use the aws s3 sync command) - for which I use the following command:

sudo tar -c --use-compress-program=pigz -f - <folder>/ |  aws s3 cp - s3://<bucket>/<path-to-folder>/

When trying to create a tar.gz file using the above command --- for a folder that is 80+GB --- I ran into the following error:

upload failed: - to s3://<bucket>/<path-to-folder>/<filename>.tar.gz An error occurred (InvalidArgument) when calling the UploadPart operation: Part number must be an integer between 1 and 10000, inclusive

Upon researching this --- I found that there is a limit of 68GB for tar files (size of file-size-field in the tar header).

Upon further research - I also found a solution (here) that shows how to create a set of tar.gz files using split:

tar cvzf - data/ | split --bytes=100GB - sda1.backup.tar.gz.

that can later be untar with:

cat sda1.backup.tar.gz.* | tar xzvf -

However - split has a different signature: split [OPTION]... [FILE [PREFIX]]

...So - the obvious solution :

sudo tar -c --use-compress-program=pigz -f - folder/ | split --bytes=20GB - prefix.tar.gz. | aws s3 cp - s3://<bucket>/<path-to-folder>/

...will not work - since split uses the prefix as a string and writes the output to a file with that set of names.

Question is: Is there a way to code this such that I an effectively use a pipe'd solution (ie., not use additional disk-space) and yet get a set of files (called prefix.tar.gz.aa, prefix.tar.gz.ab etc) in S3?

Any pointers would be helpful.

--PK

Upvotes: 1

Views: 1273

Answers (1)

Erwin
Erwin

Reputation: 952

That looks like a non-trivial challenge. Pseudo-code might look like this:

# Start with an empty list
list = ()
counter = 1
foreach file in folder/ do
  if adding file to list exceeds tar or s3 limits then
    # Flush current list of files to S3
    write list to tmpfile
    run tar czf - --files-from=tmpfile | aws s3 cp - s3://<bucket>/<path-to-file>.<counter>
    list = ()
    counter = counter + 1
  end if
  add file to list
end foreach
if list non-empty
  write list to tmpfile
  run tar czf - --files-from=tmpfile | aws s3 cp - s3://<bucket>/<path-to-file>.<counter>
end if

This uses the --files-from option of tar to avoid needing to pass individual files as command arguments and running into limitations there.

Upvotes: 1

Related Questions