GodBlessYou
GodBlessYou

Reputation: 639

Django bulk_create ignore_conflicts=True leaks memory

We are using Django 2.2, python 3.6 and mysql 5.6 for scheduling data intensive jobs.

Memory is increasing over time for a long running job. DEBUG=False in settings.py

why use ignore_conflicts?

we set up a unique key for the table, so ignore_conflicts can filter out those records already in the table.

simple code like this

for record_batch in readFromSomewhere(batch_size):
    for record in record_batch:
        product = parse(record)
        product_list.append(product)
    
    # memory increase and leak
    Product.objects.bulk_create(product_list, ignore_conflicts=True) 
    
    # memory does not increase
    #Product.objects.bulk_create(product_list)  

    #db.reset_queries()
    #gc.collect()

I read a lots of stackoverflow post and put gc.collect() and django.db.reset_query(), but it does not prevent the increase. If I use Product.objects.bulk_create(products), memory does not increase. but if I use Product.objects.bulk_create(products,ignore_conflicts=True) the memory increase over time.

the batch size is very small, around 100. I notice if the batch size is smaller, which means the number of bulk_create calls is larger, memory increases faster. If the batch size is larger, then memory increase slower.

Any thoughts to release the memory after batch created(ignore_conflict=True) into db?

Upvotes: 3

Views: 1958

Answers (1)

GodBlessYou
GodBlessYou

Reputation: 639

The root cause has been found by digging into the memory. we were using mysqlclient==1.3.14 package. this package contains warning checks. those warning messages were saved to memory and never be garbaged collected.

they directly removed all the warning checks from the package after new release. so after upgrading to mysqlclient==1.4.4, the memory becomes stable.

Upvotes: 4

Related Questions