Reputation: 3
I am trying to use following spark libraries in my aws lambda:
implementation "org.apache.spark:spark-core_2.12:2.4.6"
implementation "org.apache.spark:spark-sql_2.12:2.4.6"
I ran Lambda initially with memory: 576 MB and then 1024 MB. Both times it failed with:
Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace
Exception in thread "main" java.lang.Error: java.lang.OutOfMemoryError: Metaspace
at lambdainternal.AWSLambda.<clinit>(AWSLambda.java:65)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:150)
Caused by: java.lang.OutOfMemoryError: Metaspace
Exception in thread "Thread-3" java.lang.OutOfMemoryError: Metaspace
It ran successfully when ran with memory size: 2048 MB
I would like to know what is the actual memory size needed to use spark in AWS lambda. Is there any lighter version of the library. I am using this library to create Parquet file and upload it to S3.
Thanks.
Upvotes: 0
Views: 1143
Reputation: 19328
You definitely don't want to include Spark as a dependency in a Lambda function. Spark is way too heavy for Lambda. Spark should be run on a cluster and Lambda isn't a cluster.
If you want to run serverless Spark code, check out AWS Glue... or don't cause AWS Glue is relatively complicated to use.
If your file is sufficiently small to be converted to Parquet in a Lambda function, check out AWS Data Wrangler. The releases contain pre-built layers, so you don't need to worry about all the low level details for building Layers (figuring out numpy & PyArrow is really annoying - just use the lib).
Here's the code that writes out a Parquet file:
import awswrangler as wr
import pandas as pd
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://bucket/dataset/",
dataset=True,
database="my_db",
table="my_table"
)
Upvotes: 1
Reputation: 2208
The amount of memory you allocate to your java lambda function is shared by heap, meta, and reserved code memory.
you can consider increasing only -XX:MaxMetaspaceSize
size because as per your exception log Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Metaspace issue is related to metaspace
you can custom tune by increasing only metaspace without changing heap and buffer space. (note: spark might be loading a lot of classes and utilizing metaspace) and please also consider running your spark app in cluster mode.
you can check this thread
for more info about heap memory,metaspace and reserved code memory.
Upvotes: 1