AWS Lambda with spark library gives OutOfMemoryError

Question

I am trying to use following spark libraries in my aws lambda:

implementation "org.apache.spark:spark-core_2.12:2.4.6"
implementation "org.apache.spark:spark-sql_2.12:2.4.6"

I ran Lambda initially with memory: 576 MB and then 1024 MB. Both times it failed with:

Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace
Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Metaspace
at lambdainternal.AWSLambda.(AWSLambda.java:65)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:150)
Caused by: java.lang.OutOfMemoryError: Metaspace
Exception in thread "Thread-3" java.lang.OutOfMemoryError: Metaspace

It ran successfully when ran with memory size: 2048 MB

I would like to know what is the actual memory size needed to use spark in AWS lambda. Is there any lighter version of the library. I am using this library to create Parquet file and upload it to S3.

Thanks.

Powers · Accepted Answer

You definitely don't want to include Spark as a dependency in a Lambda function. Spark is way too heavy for Lambda. Spark should be run on a cluster and Lambda isn't a cluster.

If you want to run serverless Spark code, check out AWS Glue... or don't cause AWS Glue is relatively complicated to use.

If your file is sufficiently small to be converted to Parquet in a Lambda function, check out AWS Data Wrangler. The releases contain pre-built layers, so you don't need to worry about all the low level details for building Layers (figuring out numpy & PyArrow is really annoying - just use the lib).

Here's the code that writes out a Parquet file:

import awswrangler as wr
import pandas as pd

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

# Storing data on Data Lake
wr.s3.to_parquet(
    df=df,
    path="s3://bucket/dataset/",
    dataset=True,
    database="my_db",
    table="my_table"
)

AWS Lambda with spark library gives OutOfMemoryError

Answers (2)

Related Questions