Patch
Patch

Reputation: 754

Passing a large file to Celery for processing isn't working

I want to save a file to AWS S3 and I am using Celery because I don't want to wait until the function finishes writing the file. The problem is when I send it to a Celery function I can see that it's not the same size in my AWS file storage compare to the actual file size.

this is when I am sending it to the Celery function:

file_to_put = str(file_to_put) # because you can't send an object to celery fun
write_file_aws.delay(file_full_name, file_to_put)

the Celery function itself:

@celery.task(name="write_file_to_aws")
def write_file_aws(file_full_name, file_to_put):
    file_to_put = bytearray(file_to_put)
    s3 = boto3.resource('s3')
    s3.Object(BUCKET, file_full_name).put(Body=file_to_put)
    return "Request sent!"

This is when the file size is smaller than what it should be (for e.g 1kb instead of 22kb in pictures is even 710kb instead of 230) and the file itself is just gibberish. Why would it happen? is it because of me turning it to a string? if it is what else can I do?

Upvotes: 1

Views: 2718

Answers (2)

Nitin Nain
Nitin Nain

Reputation: 5483

You're serializing a large file and passing it as an argument to the function. I assume you're using EC2. So you could instead store the file to the AWS EC2's instance storage or EBS first (they're faster to write to then S3). Then pass the "path" to this file as an argument to the Celery function call. The Celery worker will then copy the file to S3.

i.e. this:

def write_file_aws(file_full_name, file_to_put)

will become:

def write_file_aws(file_full_name, path_to_local_file)

Here's a primer on AWS EC2 storage options: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Storage.html

Upvotes: 1

DejanLekic
DejanLekic

Reputation: 19797

For valid reasons (in short - the task arguments are stored in backend memory and if the object is large it may cause memory errors) you can't pass large objects to your Celery tasks. Instead you pass a reference to wherever the Celery task can access that large object. If it is a file, then put it on a shared filesystem (NFS for an example) accessible by all Celery nodes, and pass file name (and maybe a path if that makes it easier to you).

Upvotes: 1

Related Questions