Reputation: 965
I have a lot of data, that data is pretty dirty, example:
A table ORM :
id = models.CharField(default='', max_length=50)
time = models.DateTimeField(default=timezone.now)
number = models.CharField(default='', max_length=20)
value = models.CharField(default='', max_length=20)
unique_together = ['id', 'time', 'number']
A table DATA :
id time number value
1 2018-07-16 00:00:00 1 64
1 2018-07-16 00:00:00 2 -99
1 2018-07-16 00:00:00 3 655
1 2018-07-16 00:00:00 4 3
2 2018-07-16 00:00:00 0 12
Import Datas (sample) :
id time number value
1 2018-07-16 00:00:00 1 64
3 2018-07-16 00:00:00 0 -99
3 2018-07-16 00:00:00 0 11
4 2018-07-16 00:00:00 0 -99
4 2018-07-16 00:00:00 1 -99
So, When I Do
for loop....
objs = []
objs.append(A(**kwargs))
A.objects.bulk_create(objs, batch_size=50000)
It will raise two kind duplicate.
the "1", I can use get or create to solve it but "2", I can't check now I append data exist in the objs or not I tried to use this to check exist or not, but when data row over 1000000, the complexity will be terrible.
def search(id, time, number, objs):
for obj in objs:
if obj['id'] == id and obj['time'] == time and obj['number'] == number:
return True
return False
Is there have any better way? thanks.
Upvotes: 0
Views: 781
Reputation: 29967
You can add a tuple with id
, time
and number
to a set
:
objs = []
duplicate_check = set()
for loop....
data = kwargs['id'], kwargs['time'], kwargs['number']
if not data in duplicate_check:
objs.append(A(**kwargs))
duplicate_check.add(data)
A.objects.bulk_create(objs, batch_size=50000)
The set
operations have a complexity of O(1).
Upvotes: 2