Reputation:
I have a model like the following:
class Foo(models.Model):
fruit = models.CharField(max_length=10)
stuff = models.CharField(max_length=10)
color = models.CharField(max_length=10)
owner = models.CharField(max_length=20)
exists = models.BooleanField()
class Meta:
unique_together = (('fruit', 'stuff', 'color'), )
It is populated with some data:
fruit stuff color owner exists
Apple Table Blue abc True
Pear Book Red xyz False
Pear Phone Green xyz False
Apple Phone Blue abc True
Pear Table Green abc True
I need to merge/join this with a collection (not a queryset):
[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
So basically rows 0 and 2 should return when I search this model with this list of tuples.
Currently my workaround is to read Foo.objects.all()
into a DataFrame and do a merge with the list of tuples and get the ID's to pass to Foo.objects.filter()
. I also tried iterating over the list and calling Foo.object.get()
on each tuple but it is very slow. The list is quite big.
When I tried chaining Q's as suggested by the current answers, it threw an OperationalError (too many SQL variables).
My main goal is the following:
As it can be seen from the model these three fields together form my primary key. The table contains around 15k entries. When I get data from another source I need to check if the data is already in my table and create/update/delete accordingly (new data may contain up to 15k entries). Is there a clean and efficient way to check if these records are already in my table?
Note: The list of tuples does not have to be in that shape. I can modify it, turn it into another data structure or transpose it.
Upvotes: 8
Views: 574
Reputation: 77291
This question is likely a manifestation of the X/Y problem. Instead of asking about your problem X you are asking about the solution Y you came up with.
Why are you keeping a counter field in the first place? I mean, why not to remove the count field and query it with:
Foo.objects.order_by('fruit', 'stuff', 'color')\
.values('fruit', 'stuff', 'color')\
.annotate(count=Count('*'))
Or keep it but use the sum of count instead:
Foo.objects.order_by('fruit', 'stuff', 'color')\
.values('fruit', 'stuff', 'color')\
.annotate(total=Sum('count'))
If you drop the unique_together constraint all you have to do in order to merge the dataset is to insert your new entries in the database:
for fruit, stuff, color in collection:
Foo.objects.update_or_create(fruit=fruit, stuff=stuff, color=color)
Or assuming collection is a dict of keys and counts:
for fruit, stuff, color in collection:
Foo.objects.update_or_create(
fruit=fruit,
stuff=stuff,
color=color,
count=F('count') + collection[(fruit, stuff, color)],
)
Please don't answer "it is for performance reasons" unless you have profiled both approaches - in my not-so-humble opinion it is the database's job to keep the score. If you try it and really find a performance problem then a competent DBA will propose a solution (in rare cases it may involve keeping an auxiliary table with the count through the use of database triggers).
My point is, keeping a value that can be calculated by the database is a questionable design. You must have a good reason for it, and you must profile the 'let the database calculate it' approach first - otherwise you risk complicating your design because of imaginary performance reasons.
Anyway I can't think any strategy where you can make this better than O(n) - n being the number of entries in the dataset you want to merge.
Then I may have guessed your original problem wrong so let us know if it is the case.
Upvotes: 0
Reputation: 6865
You have ('fruit', 'stuff', 'color')
field unique together
So if your search tuple is ('Apple', 'Table', 'Blue')
and we concatenate it then also it will be a unique string
f = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
c = [''.join(w) for w in f]
# Output: ['AppleTableBlue', 'PearPhoneGreen']
So we can filter queryset on annotations and make use of Concat.
Foo.objects.annotate(u_key=Concat('fruit', 'stuff', 'color', output_field=CharField())).filter(u_key__in=c)
# Output: <QuerySet [<Foo: #0row >, <Foo: #2row>]>
This will work for tuple and list
If input is list of 2 tuple:
[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
after transpose input will be:
transpose_input = [('Apple', 'Pear'), ('Table', 'Phone'), ('Blue', 'Green')]
We can easily identify by counting each_tuple_size and input_list_size that the input is transposed. so we can use zip to transpose it again and the above solution will work as expected.
if each_tuple_size == 2 and input_list_size == 3:
transpose_again = list(zip(*transpose_input))
# use *transpose_again* variable further
If input is list of 3 tuple:
[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green'), ('Pear', 'Book', 'Red')]
After transpose input will be:
transpose_input = [('Apple', 'Pear', 'Pear'), ('Table', 'Phone', 'Book'), ('Blue', 'Green', 'Red')]
So it is impossible to identify that the input is transposed for every
n*n
matrix and above solution will Fail
Upvotes: 4
Reputation: 73470
If you know these fields constitute your natural key and you have to do heavy querying on them, add this natural key as a proper field and take measures to maintain it:
class FooQuerySet(models.QuerySet):
def bulk_create(self, objs, batch_size=None):
objs = list(objs)
for obj in objs:
obj.natural_key = Foo.get_natural_key(obj.fruit, obj.stuff, obj.color)
return super(FooQuerySet, self).bulk_create(objs, batch_size=batch_size)
# you might override update(...) with proper F and Value expressions,
# but I assume the natural key does not change
class FooManager(models.Manager):
def get_queryset(self):
return FooQuerySet(self.model, using=self._db)
class Foo(models.Model):
NK_SEP = '|||' # sth unlikely to occur in the other fields
fruit = models.CharField(max_length=10)
stuff = models.CharField(max_length=10)
color = models.CharField(max_length=10)
natural_key = models.CharField(max_length=40, unique=True, db_index=True)
@staticmethod
def get_natural_key(*args):
return Foo.NK_SEP.join(args)
def save(self, *args, **kwargs):
self.natural_key = Foo.get_natural_key(self.fruit, self.stuff, self.color)
Super(Foo, self).save(*args, **kwargs)
objects = FooManager()
class Meta:
unique_together = (('fruit', 'stuff', 'color'), )
Now you can query:
from itertools import starmap
lst = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
existing_foos = Foo.objects.filter(natural_key__in=list(starmap(Foo.get_natural_key, lst)))
And batch create:
Foo.objects.bulk_create(
[
Foo(fruit=x[0], stuff=x[1], color=x[2])
for x in lst
if x not in set(existing_foos.values_list('fruit', 'stuff', 'color'))
]
)
Upvotes: 1
Reputation: 5720
this is the correct query:
q = Foo.objects.filter(
Q(fruit='Apple', stuff='Table', color='Blue') |
Q(fruit='Pear', stuff='Phone', color='Green')
)
also this query will work too (If you don't like Q
):
q = Foo.objects.filter(
fruit='Apple', stuff='Table', color='Blue'
) | Foo.objects.filter(
fruit='Pear', stuff='Phone', color='Green'
)
Upvotes: 2
Reputation: 15738
What you did with the Q
is AND
between all the where in
statements
What you wanted to achieve is OR all the Q with tuple attributes set as following
Foo.objects.filter(Q(fruit='Apple',stuff='Pear',color='Blue)|Q...
To do this programmatic you can do something like the following:
tuple = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
query = reduce(lambda q,value: q|Q(fruit=value[0], stuff=value[1], color=value[2]), tuple, Q())
Foo.objects.filter(query)
Upvotes: 0