en-core-web-sm module error - Serverless deployment AWS Lambda

I am using SpaCy's en-core-web-sm in my Python AWS Lambda. I ran pip freeze > requirements.txt to get all the dependencies in the requirements.txt file. en-core-web-sm==2.1.0 is one of the lines in the file.

When I try to make a serverless deployment, I get ERROR: Could not find a version that satisfies the requirement en-core-web-sm==2.1.0 (from versions: none) ERROR: No matching distribution found for en-core-web-sm==2.1.0 .

Even though I am not using Heroku, I followed Heroku Deployment Error: No matching distribution found for en-core-web-sm and added the line https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0 in my requirements.txt file only to get Unzipped size must be smaller than 262144000 bytes (Service: AWSLambdaInternal; Status Code: 400; Error Code: InvalidParameterValueException; Request ID: XxX-XxX)

How to wire up en-web-core-sm to my Lambda?

Upvotes: 2

Views: 2361

Answers (1)

Chandan Gupta
Chandan Gupta

Reputation: 722

Take the advantage of the model being a separate component to the library and uploaded the model in an S3 bucket. Before initialising spaCy, I download the model from S3. This is accomplished by the method below.

def download_dir(dist, local, bucket):
    client = get_boto3_client('s3', lambda n: boto3.client('s3'))
    resource = get_boto3_client('s3r', lambda n: boto3.resource('s3'))

    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(subdir.get('Prefix'), local, bucket)
        if result.get('Contents') is not None:
            for file in result.get('Contents'):

                if not os.path.exists(os.path.dirname(local + os.sep + file.get('Key'))):
                    os.makedirs(os.path.dirname(local + os.sep + file.get('Key')))
                dest_path = local + os.sep + file.get('Key')

                if not dest_path.endswith('/'):
                    resource.meta.client.download_file(bucket, file.get('Key'), dest_path)

And the code using spaCy looks like this:

import spacy
if not os.path.isdir(f'/tmp/en_core_web_sm-2.0.0'):
       download_dir(lang, '/tmp', mapping_bucket)
spacy.util.set_data_path('/tmp')

nlp = spacy.load(f'/tmp/en_core_web_sm-2.0.0')
doc = nlp(spacy_input)
for token in doc:
    print(token.text, token.pos_, token.label_)

Upvotes: 4

Related Questions