Pranav Arora
Pranav Arora

Reputation: 41

Memory Leak - After every request hit on Flask API running in a container

I have a flask app running in a container on EC2. On starting the container, the docker stats gave memory usage close to 48MB. After making the first API call (reading a 2gb file from s3), the usage rises to 5.72GB. Even after completion of the api call, the usage does not go down.

On hitting the request, the usage goes up by around twice the file size and after a few requests, the server starts giving the memory error

Also, on running the same Flask app without the container, we do not see any such increment in memory utilized.

Output of "docker stats <container_id>" before hitting the API-

Output of "docker stats <container_id>" after hitting the API

Flask app (app.py) contains-

import os
import json
import pandas as pd
import flask

app = flask.Flask(__name__)


@app.route('/uploadData', methods=['POST'])
def test():
    json_input = flask.request.args.to_dict()
    s3_path = json_input['s3_path']
    # reading file directly from s3 - without downloading
    df = pd.read_csv(s3_path)
    print(df.head(5))
    
    #clearing df
    df = None
    return json_input

@app.route('/healthcheck', methods=['GET'])
def HealthCheck():
    return "Success"

if __name__ == '__main__':
    app.run(host="0.0.0.0", port='8898')

Docker contains-

FROM python:3.7.10

RUN apt-get update -y && apt-get install -y python-dev

# We copy just the requirements.txt first to leverage Docker cache
COPY . /app_abhi
WORKDIR /app_abhi

EXPOSE 8898

RUN pip3 install flask boto3 pandas fsspec s3fs

CMD [ "python","-u", "app.py" ]

I tried reading the file directly from S3 as well as downloading the file and then reading it but it did not work.

Any leads in getting this memory utilization down to the initial consumption would be a great help!

Upvotes: 4

Views: 6677

Answers (3)

Maxwell86
Maxwell86

Reputation: 133

I had a similar problem (see question Google Cloud Run: script requires little memory, yet reaches memory limit)

Finally, I was able to solve it by adding

import gc
...
gc.collect()

Upvotes: 0

Sonal Bansal
Sonal Bansal

Reputation: 9

You can try following possible solutions:

  1. Update the dtype of the columns : Pandas (by default) try to infer dtypes of the datatype of columns when it creates a dataframe. Certain data types can result in large memory allocation. You can reduce it by updating the dtypes of such columns. e.g. update integer columns to pd.np.int8 and float columns to pd.np.float16. Refer this : Pandas/Python memory spike while reading 3.2 GB file

  2. Read data in Chunks : You can read data into a chunk size say and perform the required processing on the chunk and then moving on to the new chunk. This way you will not be storing the entire data into memory. Although reading data into chunks can be slower as compared to reading whole data at once, but it is memory efficient.

  3. Try using new library : Dask DataFrame is used in situations where Pandas is commonly needed, usually when Pandas fails due to data size or computation speed. But you might not find a lot of built-in pandas operations in Dask. https://docs.dask.org/en/latest/dataframe.html

Upvotes: 1

Dave W. Smith
Dave W. Smith

Reputation: 24966

The memory growth is almost certainly caused by constructing the dataframe.

df = None doesn't return that memory to the operating system, though it does return memory to the heap managed within the process. There's an explanation for that in How do I release memory used by a pandas dataframe?

Upvotes: 0

Related Questions