Reputation: 41
I have a flask app running in a container on EC2. On starting the container, the docker stats gave memory usage close to 48MB. After making the first API call (reading a 2gb file from s3), the usage rises to 5.72GB. Even after completion of the api call, the usage does not go down.
On hitting the request, the usage goes up by around twice the file size and after a few requests, the server starts giving the memory error
Also, on running the same Flask app without the container, we do not see any such increment in memory utilized.
Output of "docker stats <container_id>" before hitting the API-
Output of "docker stats <container_id>" after hitting the API
Flask app (app.py) contains-
import os
import json
import pandas as pd
import flask
app = flask.Flask(__name__)
@app.route('/uploadData', methods=['POST'])
def test():
json_input = flask.request.args.to_dict()
s3_path = json_input['s3_path']
# reading file directly from s3 - without downloading
df = pd.read_csv(s3_path)
print(df.head(5))
#clearing df
df = None
return json_input
@app.route('/healthcheck', methods=['GET'])
def HealthCheck():
return "Success"
if __name__ == '__main__':
app.run(host="0.0.0.0", port='8898')
Docker contains-
FROM python:3.7.10
RUN apt-get update -y && apt-get install -y python-dev
# We copy just the requirements.txt first to leverage Docker cache
COPY . /app_abhi
WORKDIR /app_abhi
EXPOSE 8898
RUN pip3 install flask boto3 pandas fsspec s3fs
CMD [ "python","-u", "app.py" ]
I tried reading the file directly from S3 as well as downloading the file and then reading it but it did not work.
Any leads in getting this memory utilization down to the initial consumption would be a great help!
Upvotes: 4
Views: 6677
Reputation: 133
I had a similar problem (see question Google Cloud Run: script requires little memory, yet reaches memory limit)
Finally, I was able to solve it by adding
import gc
...
gc.collect()
Upvotes: 0
Reputation: 9
You can try following possible solutions:
Update the dtype of the columns : Pandas (by default) try to infer dtypes of the datatype of columns when it creates a dataframe. Certain data types can result in large memory allocation. You can reduce it by updating the dtypes of such columns. e.g. update integer columns to pd.np.int8 and float columns to pd.np.float16. Refer this : Pandas/Python memory spike while reading 3.2 GB file
Read data in Chunks : You can read data into a chunk size say and perform the required processing on the chunk and then moving on to the new chunk. This way you will not be storing the entire data into memory. Although reading data into chunks can be slower as compared to reading whole data at once, but it is memory efficient.
Try using new library : Dask DataFrame is used in situations where Pandas is commonly needed, usually when Pandas fails due to data size or computation speed. But you might not find a lot of built-in pandas operations in Dask. https://docs.dask.org/en/latest/dataframe.html
Upvotes: 1
Reputation: 24966
The memory growth is almost certainly caused by constructing the dataframe.
df = None
doesn't return that memory to the operating system, though it does return memory to the heap managed within the process. There's an explanation for that in How do I release memory used by a pandas dataframe?
Upvotes: 0