Reputation: 71
I have Hadoop in my environment and use it as S3.
My current task is to implement a logic for uploading large file (say, >1Gb) to Hadoop with no buffering, so the data should be streamed into it.
I've found org.apache.hadoop.fs.MultipartUploader
interface, which decription looks like exactly what i need. But there is no any guide of how to use it in official docs, only text description and order of operations for uploading (start, putPart, complete).
Javadoc has similar description.
I've tried to use this interface as follows:
according to the docs, i've created MultipartUploader instance (FileSystemMultipartUploader is the implementation i need):
FileSystemMultipartUploader mu =
(FileSystemMultipartUploader) fs.createMultipartUploader(new Path("base/path/for/uploading/to/hdfs"));
Path targetPath = new Path("base/path/plus/result/filename");
Then the first step of the MultipartUploader interface:
/**
* Initialize a multipart upload.
* @param filePath Target path for upload.
* @return unique identifier associating part uploads.
* @throws IOException IO failure
*/
CompletableFuture<UploadHandle> startUpload(Path filePath)
throws IOException;
So, we're declaring initialization of uploading:
UploadHandle uh = mu.startUpload(targetPath).get();
Second part:
/**
* Put part as part of a multipart upload.
* It is possible to have parts uploaded in any order (or in parallel).
* @param uploadId Identifier from {@link #startUpload(Path)}.
* @param partNumber Index of the part relative to others.
* @param filePath Target path for upload (as {@link #startUpload(Path)}).
* @param inputStream Data for this part. Implementations MUST close this
* stream after reading in the data.
* @param lengthInBytes Target length to read from the stream.
* @return unique PartHandle identifier for the uploaded part.
* @throws IOException IO failure
*/
CompletableFuture<PartHandle> putPart(
UploadHandle uploadId,
int partNumber,
Path filePath,
InputStream inputStream,
long lengthInBytes)
throws IOException;
And here is the dificulties appear:
ok, we have uploadId
from the first step and result filePath
, but we have no idea what are these partNumber
, inputStream
and lengthInBytes
!
I think, it's supposed, that partNumber
is an ordered number of an uploadable file's chunk, and inputStream
with lengthInBytes
are the input stream and length in bytes of the chunk respectively.
But i don't understand, where these three parameters should be taken from. I thought, this MultipartUploader will "slice" the file itself, under the hood.
There is no info about it in Hadoop's official docs, and i also didn't manage to find any example of the MultipartUploader's usage neither on stackoverflow, nor anywhere else.
So please, can anyone explain how to use or share some example, if you have any?
Upvotes: 0
Views: 33