Images in Blobstore: inefficient to get metadata?

Question

Summary: I'm using Blobstore to let users upload images to be served. I want to prevent users from uploading files that aren't valid images or have dimensions that are too large. I'm using App Engine's Images service to get the relevant metadata. BUT, in order to get any information about the image type or dimensions from the Images service, you have to first execute a transform, which fetches the transformed image to the App Engine server. I have it do a no-op crop and encode as a very low quality JPEG image, but it's still fetching an actual image, and all I want is the dimensions and file type. Is this the best I can do? Will the internal transfer of the image data (from Blobstore to App Engine server) cost me?

Details:

It seems like Blobstore was carefully designed for efficient serving of images from App Engine. On the other hand, certain operations seem to make you jump through inefficient hoops. I'm hoping someone can tell me that there's a more efficient way, or convince me that what I'm doing is not as wasteful as I think it is.

I'm letting users upload images to be served as part of other user-generated content. Blobstore makes the uploading and serving pretty easy. Unfortunately it lets the user upload any file they want, and I want to impose restrictions.

(Side note: Blobstore does let you limit the file size of uploads, but this feature is poorly documented. It turns out that if the user tries to exceed the limit, Blobstore will return a 413 "Entity too large", and the App Engine handler is not called at all.)

I want to allow only valid JPEG, GIF, and PNG files, and I want to limit the dimensions. The way to do this seems to be to check the file after upload, and delete it if it's not allowed. Here's what I've got:

class ImageUploadHandler(blobstore_handlers.BlobstoreUploadHandler):
  def post(self):
    try:
      # TODO: Check that user is logged in and has quota; xsrfToken.
      uploads = self.get_uploads()
      if len(uploads) != 1:
        logging.error('{} files uploaded'.format(len(uploads)))
        raise ServerError('Must be exactly 1 image per upload')
      image = images.Image(blob_key=uploads[0].key())
      # Do a no-op transformation; otherwise execute_transforms()
      # doesn't work and you can't get any image metadata.
      image.crop(0.0, 0.0, 1.0, 1.0)
      image.execute_transforms(output_encoding=images.JPEG, quality=1)
      if image.width > 640 or image.height > 640:
        raise ServerError('Image must be 640x640 or smaller')
      resultUrl = images.get_serving_url(uploads[0].key())
      self.response.headers['Content-Type'] = 'application/json'
      self.response.body = jsonEncode({'status': 0, 'imageUrl': resultUrl})
    except Exception as e:
      for upload in uploads:
        blobstore.delete(upload.key()) # TODO: delete in parallel with delete_async
      self.response.headers['Content-Type'] = 'text/plain'
      self.response.status = 403
      self.response.body = e.args[0]

Comments in the code highlight the issue.

I know the image can be resized on the fly at serve time (using get_serving_url), but I'd rather force users to upload a smaller image in the first place, to avoid using up storage. Later, instead of putting a limit on the original image dimensions, I might want to have it automatically get shrunk at upload time, but I'd still need to find out its dimensions and type before shrinking it.

Am I missing an easier or more efficient way?

Dan Cornilescu · Accepted Answer

Actually the Blobstore is not exactly optimized for serving images, it operates on any kind of data. The BlobReader class can be used to manage the raw blob data.

The GAE Images service can be used to manage images (including those stored as blobs in the BlobStore). You are right in the sense that this service only offers info about the uploaded image only after executing a transformation on it, which doesn't help with deleting undesirable blob images prior to processing.

What you can do is use the Image module from the PIL library (available between the GAE's Runtime-Provided Libraries) overlayed on top of the BlobReader class.

The PIL Image format and size methods to get the info you seek and sanitize the image data before reading the entire image:

>>> image = Image.open('Spain-rail-map.jpg')
>>> image.format
'JPEG'
>>> image.size
(410, 317)

These methods should be very efficient since they only need image header info from the blob loaded by the open method:

Opens and identifies the given image file. This is a lazy operation; the function reads the file header, but the actual image data is not read from the file until you try to process the data (call the load method to force loading).

This is how overlaying can be done in your ImageUploadHandler:

  from PIL import Image
  with blobstore.BlobReader(uploads[0].key()) as fd:
      image = Image.open(fd)
      logging.error('format=%s' % image.format)
      logging.error('size=%dx%d' % image.size)

Images in Blobstore: inefficient to get metadata?

Answers (2)

Related Questions