Below the Radar
Below the Radar

Reputation: 7635

Django - bulk_create() lead to memory error

I have around 400 000 object instances to insert in postgres. I am using bulk_create() to do so, but I face a Memory error.

My first idea was to chunk the list of instances:

def chunks(l, n):
    n = max(1, n)
    return [l[i:i + n] for i in range(0, len(l), n)]

for c in chunks(instances, 1000):
    Feature.objects.bulk_create(c)

But sometimes that strategy also leads to Memory Error because instance's size can vary a lot, so a chunk could exceed the memory limit while others dont.

Is it possible to chunk the list of instances in order to have chunks of delimited size? What would be the best approach in this case?

Upvotes: 12

Views: 13705

Answers (6)

blokeish
blokeish

Reputation: 601

I just wanted to share a slightly different approach which may help someone else. The first thing to do to reduce memory dependency is NOT to load the entire dataset into memory. If the date comes from a file, just read a set batch size of records from the file, then create object list and write that to database. Then clear that object list and create the next batch. This way not the entire dataset is loaded into memory at once.

In my case I had the entire data loaded into the memory (list) since it wasn't too large and I had to do something else with it. So I decided to extract a subset (batch) of the list and write it to database at a time.

# MyModelObjectList is the list variable with all the data model objects
# batch_size is the number of records you want to write to DB in one transaction
def my_bulk_create(MyModel, MyModelObjectList, batch_size=1000):
    for x in range(0, len(MyModelObjectList), batch_size):
        MyModel.objects.bulk_create(MyModelObjectList[ x: x+batch_size ]) # Extracting a subset of the list

You could have used islice() instead

def my_bulk_create(MyModel, MyModelObjectList, batch_size=1000):    
    for x in range(0, len(MyModelObjectList), batch_size):
        MyModel.objects.bulk_create(list(islice(MyModelObjectList, x, x+batch_size))

I guess this will remove the infinite loop from this code

def my_bulk_create(MyModel, MyModelObjectList, batch_size=1000):    
    batchStart = 0
    while True:
        batch= list(islice(MyModelObjectList, batchStart, batchStart + batch_size))
        if not batch:
            break
        batchStart += batch_size # New start position for the next iteration
        MyModel.objects.bulk_create(batch)

NB: I have not personally executed the code snippets provided here, except the first one. So you may have to debug it a bit.

Upvotes: 0

Alexandr S.
Alexandr S.

Reputation: 1774

Maybe it'll be helpful for someone, this is an example of using generators + banch_size in Django:

from itertools import islice
from my_app.models import MyModel

def create_data(data):
    bulk_create(MyModel, generator())

def bulk_create(model, generator, batch_size=10000):
    """
    Uses islice to call bulk_create on batches of
    Model objects from a generator.
    """
    while True:
        items = list(islice(generator, batch_size))
        if not items:
            break
        model.objects.bulk_create(items)

def generator(data):
    for row in data:
        yield MyModel(field1=data['field1'])

The original article was here - https://concisecoder.io/2019/04/19/avoid-memory-issues-with-djangos-bulk_create/

Upvotes: 1

Du D.
Du D.

Reputation: 5310

You can specify the batch_size in the bulk_create method.

Syntax: bulk_create(objs, batch_size=None)
Feature.objects.bulk_create(instances, batch_size=1000)

Django 2.2: https://docs.djangoproject.com/en/2.2/ref/models/querysets/#bulk-create

Django 3.1: https://docs.djangoproject.com/en/3.1/ref/models/querysets/#bulk-create

Upvotes: 12

linqu
linqu

Reputation: 11970

If your are using Django in debug mode it will keep track on all your sql statements for debugging purposes. For many objects this may cause memory problems. You can reset that with:

from django import db
db.reset_queries()

see why-is-django-leaking-memory

Upvotes: 16

Valar
Valar

Reputation: 2033

If you are not running with DEBUG mode, and you still have errors, my solution should help you. First, ensure that you have a lazily generated set of objects to be saved (for example, fetched from remote API in batches)

def generate_data():
    """Example data generator"""
    for i in range(100000):
        yield Model(counter=i)

data_gen = generate_data()
# >> print data_gen
# <generator object data at 0x7f057591c5c8>
# 
# it's a generator, objects are not yet created. 
# You can iterate it one-by-one or force generation using list(data_gen)
# But for our approach, we need generator version

Next, we need a function that will take at most X objects from that generator at once and save it using batch_create. This way, in a single moment we'll be holding no more than X objects in a memory.

from itertools import islice

def bulk_create_iter(iterable, batch_size=10000):
    """Bulk create supporting generators. Returns only count of created objects."""
    created = 0
    while True:
        objects = Model.bulk_create(islice(iterable, batch_size))
        created += len(objects)
        if not objects:
            break
    return created

and use it like this

print(bulk_create_iter(data_gen))
# prints 100000

The reason why can't just use batch_create is that internally it's doing list(objs), so the whole generator is instantiated and saved to the memory. In this approach, we're instantiating maximum of batch_size objects at once. This method can be used to process even very large sets, as memory consumption should be constant (tested with 15 000 000 records, memory usage was under 300MB all the time).

Ready to use, generic version of this function, as a method of Django Manager class (you can use it in your model by writing objects = BulkManager()):

from itertools import islice
from django.db import models

class BulkManager(models.Manager):

    def bulk_create_iter(self, iterable, batch_size=10000):
        """Bulk create supporting generators, returns number of created objects."""
        created = 0
        while True:
            objects = self.bulk_create(islice(iterable, batch_size))
            created += len(objects)
            if not objects:
                break
        return created

Upvotes: 6

ramusus
ramusus

Reputation: 8315

I faced with the same problem and ended up with this solution:

class BulkCreateManager(object):

    model = None
    chunk_size = None
    instances = None

    def __init__(self, model, chunk_size=None, *args):
        self.model = model
        self.chunk_size = chunk_size
        self.instances = []

    def append(self, instance):
        if self.chunk_size and len(self.instances) > self.chunk_size:
            self.create()
            self.instances = []

        self.instances.append(instance)

    def create(self):
        self.model.objects.bulk_create(self.instances)


instances = BulkCreateManager(Model, 23000)
for host in hosts:
    instance = ...
    instances.append(instance)

instances.create()

Upvotes: 2

Related Questions