Mick
Mick

Reputation: 11

Django bulk_upsert saves queries to memory causing memory errors

For our Django web server we have quite limited resources which means we have to be careful with the amount of memory we use. One part of our web server is a crom job (using celery and rabbitmq) that parses a ~130MB csv file into our postgres database. The csv file is saved to disk and then read using the csv module from python, reading row by row. Because the csv file is basically a feed, we use the bulk_upsert from the custom postgres manager from django-postgres-extra to upsert our data and override existing entries. Recently we started experiencing memory errors and we eventually found out they were caused by Django.

Running mem_top() showed us that Django was storing massive upsert queries(INSERT ... ON CONFLICT DO) including their metadata, in memory. Each bulk_upsert of 15000 rows would add 40MB memory used by python, leading to a total of 1GB memory used when the job would finish as we upsert 750.000 rows in total. Apparently Django does not release the query from memory after it's finished. Running the crom job without the upsert call would lead to a max memory usage of 80MB, of which 60MB is default for celery.

We tried running gc.collect() and django.db.reset_queries() but the queries are still stored in memory. Our Debug setting is set to false and CONN_MAX_AGE is also not set. Currently we're out clues for where to look to fix this issue, we can't run our crom jobs now. Do you know of any last resorts to try to resolve this issue?

Some more meta info regarding our server:

django==2.1.3
django-elasticsearch-dsl==0.5.1
elasticsearch-dsl==6.1.0
psycopg2-binary==2.7.5
gunicorn==19.9.0
celery==4.3.0
django-celery-beat==1.5.0
django-postgres-extra==1.22

Thank you very much in advance!

Upvotes: 0

Views: 475

Answers (1)

Mick
Mick

Reputation: 11

Today I've found the solutions for our issues so I thought it'd be great to share. It turned out that the issue was a combination of Django and Sentry (which we only use on our production server). Django would log the query and Sentry would then catch this log and keep it in memory for some reason. As each raw SQL query was about 40MB this ate a lot of memory. Currently, we turned Sentry off on our crom job server and are looking into a way to clear the logs kept by sentry.

Upvotes: 1

Related Questions