AWS multipart upload from inputStream has bad offfset

Question

I am using the Java Amazon AWS SDK to perform some multipart uploads from HDFS to S3. My code is the following:

for (int i = startingPart; currentFilePosition < contentLength ; i++)
        {
            FSDataInputStream inputStream = fs.open(new Path(hdfsFullPath));

            // Last part can be less than 5 MB. Adjust part size.
            partSize = Math.min(partSize, (contentLength - currentFilePosition));

            // Create request to upload a part.
            UploadPartRequest uploadRequest = new UploadPartRequest()
                    .withBucketName(bucket).withKey(s3Name)
                    .withUploadId(currentUploadId)
                    .withPartNumber(i)
                    .withFileOffset(currentFilePosition)
                    .withInputStream(inputStream)
                    .withPartSize(partSize);

            // Upload part and add response to our list.
            partETags.add(s3Client.uploadPart(uploadRequest).getPartETag());
            currentFilePosition += partSize;

            inputStream.close();

            lastFilePosition = currentFilePosition;
        }

However, the uploaded file is not the same as the original one. More specifically, I am testing on a test file, which has about 20 MB. The parts I upload are 5 MB each. At the end of each 5MB part, I see some extra text, which is always 96 characters long.

Even stranger, if I add something stupid to .withFileOffset(), for example,

.withFileOffset(currentFilePosition-34)

the error stays the same. I was expecting to get other characters, but I am getting the EXACT 96 extra characters as if I hadn't modified the line.

Any ideas what might be wrong?

Thanks, Serban

Serban Stoenescu · Accepted Answer

I figured it out. This came from a stupid assumption on my part. It turns out, the file offset in ".withFileOffset(...)" tells you the offset where to write in the destination file. It doesn't say anything about the source. By opening and closing the stream repeatedly, I am always writing from the beginning of the file, but to a different offset. The solution is to add a seek statement after opening the stream:

            FSDataInputStream inputStream = fs.open(new Path(hdfsFullPath));

            inputStream.seek(currentFilePosition);

AWS multipart upload from inputStream has bad offfset

Answers (1)

Related Questions