Reputation: 12545
I want to download large files from Google Cloud Storage using the google provided Java library com.google.cloud.storage. I have working code, but I still have one question and one major concern:
My major concern is, when is the file content actually downloaded? During (references to the code below) storage.get(blobId)
, during blob.reader()
or during reader.read(bytes)
? This gets very important when it comes to how to handle an invalid checksum, what do I need to do in order to actually trigger that the file is fetched over the network again?
The simpler question is: Is there built in functionality to do md5 (or crc32c) check on the received file in the google library? Maybe I don't need to implement it on my own.
Here is my method trying to download big files from Google Cloud Storage:
private static final int MAX_NUMBER_OF_TRIES = 3;
public Path downloadFile(String storageFileName, String bucketName) throws IOException {
// In my real code, this is a field populated in the constructor.
Storage storage = Objects.requireNonNull(StorageOptions.getDefaultInstance().getService());
BlobId blobId = BlobId.of(bucketName, storageFileName);
Path outputFile = Paths.get(storageFileName.replaceAll("/", "-"));
int retryCounter = 1;
Blob blob;
boolean checksumOk;
MessageDigest messageDigest;
try {
messageDigest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException ex) {
throw new RuntimeException(ex);
}
do {
LOGGER.debug("Start download file {} from bucket {} to Content Store (try {})", storageFileName, bucketName, retryCounter);
blob = storage.get(blobId);
if (null == blob) {
throw new CloudStorageCommunicationException("Failed to download file after " + retryCounter + " tries.");
}
if (Files.exists(outputFile)) {
Files.delete(outputFile);
}
try (ReadChannel reader = blob.reader();
FileChannel channel = new FileOutputStream(outputFile.toFile(), true).getChannel()) {
ByteBuffer bytes = ByteBuffer.allocate(128 * 1024);
int bytesRead = reader.read(bytes);
while (bytesRead > 0) {
bytes.flip();
messageDigest.update(bytes.array(), 0, bytesRead);
channel.write(bytes);
bytes.clear();
bytesRead = reader.read(bytes);
}
}
String checksum = Base64.encodeBase64String(messageDigest.digest());
checksumOk = checksum.equals(blob.getMd5());
if (!checksumOk) {
Files.delete(outputFile);
messageDigest.reset();
}
} while (++retryCounter <= MAX_NUMBER_OF_TRIES && !checksumOk);
if (!checksumOk) {
throw new CloudStorageCommunicationException("Failed to download file after " + MAX_NUMBER_OF_TRIES + " tries.");
}
return outputFile;
}
Upvotes: 3
Views: 3677
Reputation: 38714
As the JavaDoc of ReadChannel
says:
Implementations of this class may buffer data internally to reduce remote calls.
So the implementation you get from blob.reader()
could cache the whole file, some bytes or nothing and just fetch byte for byte when you call read()
. You will never know and you shouldn't care.
As only read()
throws an IOException
and the other methods you used do not, I'd say that only calling read()
will actually download stuff. You can also see this in the sources of the lib.
Btw. despite the example in the JavaDocs of the library, you should check for >= 0
, not > 0
. 0
just means nothing was read, not that end of stream is reached. End of stream is signaled by returning -1
.
For retrying after a failed checksum check, get a new reader from the blob. If something caches the downloaded data, then the reader itself. So if you get a new reader from the blob, the file will be redownloaded from remote.
Upvotes: 0
Reputation: 38389
The google-cloud-java storage library does not validate checksums on its own when reading data beyond normal HTTPS/TCP correctness checking. If it compared the MD5 of the received data to the known MD5, it would need to download the entire file before it could return any results from read()
, which for very large files would be infeasible.
What you're doing is a good idea if you need the additional protection of comparing MD5s. If this is a one-off task, you could use the gsutil
command-line tool, which does this same sort of additional check.
Upvotes: 2