Frank Liao
Frank Liao

Reputation: 965

How Django bulks create check exist already in bulks objs and instance?

I have a lot of data, that data is pretty dirty, example:

A table ORM :

id = models.CharField(default='', max_length=50)
time = models.DateTimeField(default=timezone.now)
number = models.CharField(default='', max_length=20)
value = models.CharField(default='', max_length=20)

unique_together = ['id', 'time', 'number']

A table DATA :

id   time                   number   value
 1     2018-07-16 00:00:00   1         64
 1     2018-07-16 00:00:00   2         -99
 1     2018-07-16 00:00:00   3         655
 1     2018-07-16 00:00:00   4         3
 2     2018-07-16 00:00:00   0         12

Import Datas (sample) :

id   time                   number   value
 1     2018-07-16 00:00:00   1         64
 3     2018-07-16 00:00:00   0         -99
 3     2018-07-16 00:00:00   0         11
 4     2018-07-16 00:00:00   0         -99
 4     2018-07-16 00:00:00   1         -99

So, When I Do

for loop....
    objs = []
    objs.append(A(**kwargs))
A.objects.bulk_create(objs, batch_size=50000) 

It will raise two kind duplicate.

  1. A Table already exist " 1 2018-07-16 00:00:00 1"
  2. Import Datas already exist 3 2018-07-16 00:00:00 0 for two times in objs, so when I bulks create it will raise duplicate, then it will roll back all commit !!!

the "1", I can use get or create to solve it but "2", I can't check now I append data exist in the objs or not I tried to use this to check exist or not, but when data row over 1000000, the complexity will be terrible.

def search(id, time, number, objs):
    for obj in objs:
        if obj['id'] == id and obj['time'] == time and obj['number'] == number:
            return True
    return False

Is there have any better way? thanks.

Upvotes: 0

Views: 781

Answers (1)

Daniel Hepper
Daniel Hepper

Reputation: 29967

You can add a tuple with id, time and number to a set:

objs = []
duplicate_check = set()
for loop....
    data = kwargs['id'], kwargs['time'], kwargs['number']
    if not data in duplicate_check:
        objs.append(A(**kwargs))
        duplicate_check.add(data)
A.objects.bulk_create(objs, batch_size=50000) 

The set operations have a complexity of O(1).

Upvotes: 2

Related Questions