Reputation:

Searching in multiple fields respecting the row order

I have a model like the following:

class Foo(models.Model):
    fruit = models.CharField(max_length=10)
    stuff = models.CharField(max_length=10)
    color = models.CharField(max_length=10)
    owner = models.CharField(max_length=20)
    exists = models.BooleanField()
    class Meta:
        unique_together = (('fruit', 'stuff', 'color'), )

It is populated with some data:

fruit  stuff  color   owner  exists
Apple  Table   Blue     abc    True
 Pear   Book    Red     xyz   False
 Pear  Phone  Green     xyz   False
Apple  Phone   Blue     abc    True
 Pear  Table  Green     abc    True

I need to merge/join this with a collection (not a queryset):

[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]

So basically rows 0 and 2 should return when I search this model with this list of tuples.

Currently my workaround is to read Foo.objects.all() into a DataFrame and do a merge with the list of tuples and get the ID's to pass to Foo.objects.filter(). I also tried iterating over the list and calling Foo.object.get() on each tuple but it is very slow. The list is quite big.

When I tried chaining Q's as suggested by the current answers, it threw an OperationalError (too many SQL variables).

My main goal is the following:

As it can be seen from the model these three fields together form my primary key. The table contains around 15k entries. When I get data from another source I need to check if the data is already in my table and create/update/delete accordingly (new data may contain up to 15k entries). Is there a clean and efficient way to check if these records are already in my table?

Note: The list of tuples does not have to be in that shape. I can modify it, turn it into another data structure or transpose it.

Upvotes: 8

Answers (5)

Paulo Scardine

Reputation: 77291

This question is likely a manifestation of the X/Y problem. Instead of asking about your problem X you are asking about the solution Y you came up with.

Why are you keeping a counter field in the first place? I mean, why not to remove the count field and query it with:

Foo.objects.order_by('fruit', 'stuff', 'color')\
           .values('fruit', 'stuff', 'color')\
           .annotate(count=Count('*'))

Or keep it but use the sum of count instead:

Foo.objects.order_by('fruit', 'stuff', 'color')\
           .values('fruit', 'stuff', 'color')\
           .annotate(total=Sum('count'))

If you drop the unique_together constraint all you have to do in order to merge the dataset is to insert your new entries in the database:

for fruit, stuff, color in collection:
    Foo.objects.update_or_create(fruit=fruit, stuff=stuff, color=color)

Or assuming collection is a dict of keys and counts:

for fruit, stuff, color in collection:
    Foo.objects.update_or_create(
         fruit=fruit, 
         stuff=stuff, 
         color=color,
         count=F('count') + collection[(fruit, stuff, color)],
    )

Please don't answer "it is for performance reasons" unless you have profiled both approaches - in my not-so-humble opinion it is the database's job to keep the score. If you try it and really find a performance problem then a competent DBA will propose a solution (in rare cases it may involve keeping an auxiliary table with the count through the use of database triggers).

My point is, keeping a value that can be calculated by the database is a questionable design. You must have a good reason for it, and you must profile the 'let the database calculate it' approach first - otherwise you risk complicating your design because of imaginary performance reasons.

Anyway I can't think any strategy where you can make this better than O(n) - n being the number of entries in the dataset you want to merge.

Then I may have guessed your original problem wrong so let us know if it is the case.

Upvotes: 0

Satendra

Reputation: 6865

You have ('fruit', 'stuff', 'color') field unique together

So if your search tuple is ('Apple', 'Table', 'Blue') and we concatenate it then also it will be a unique string

f = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
c = [''.join(w) for w in f]
# Output: ['AppleTableBlue', 'PearPhoneGreen']

So we can filter queryset on annotations and make use of Concat.

Foo.objects.annotate(u_key=Concat('fruit', 'stuff', 'color', output_field=CharField())).filter(u_key__in=c)
# Output: <QuerySet [<Foo: #0row >, <Foo: #2row>]>

This will work for tuple and list

Transpose case

case 1:

If input is list of 2 tuple:

[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]

after transpose input will be:

transpose_input = [('Apple', 'Pear'), ('Table', 'Phone'), ('Blue', 'Green')]

We can easily identify by counting each_tuple_size and input_list_size that the input is transposed. so we can use zip to transpose it again and the above solution will work as expected.

if each_tuple_size == 2 and input_list_size == 3:
    transpose_again = list(zip(*transpose_input))
    #  use *transpose_again* variable further

case 2:

If input is list of 3 tuple:

[('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green'), ('Pear', 'Book', 'Red')]

After transpose input will be:

transpose_input = [('Apple', 'Pear', 'Pear'), ('Table', 'Phone', 'Book'), ('Blue', 'Green', 'Red')]

So it is impossible to identify that the input is transposed for every n*n matrix and above solution will Fail

Upvotes: 4

user2390182

Reputation: 73470

If you know these fields constitute your natural key and you have to do heavy querying on them, add this natural key as a proper field and take measures to maintain it:

class FooQuerySet(models.QuerySet):
    def bulk_create(self, objs, batch_size=None):
        objs = list(objs)
        for obj in objs:
            obj.natural_key = Foo.get_natural_key(obj.fruit, obj.stuff, obj.color)
        return super(FooQuerySet, self).bulk_create(objs, batch_size=batch_size)

    # you might override update(...) with proper F and Value expressions, 
    # but I assume the natural key does not change

class FooManager(models.Manager):
    def get_queryset(self):
        return FooQuerySet(self.model, using=self._db)

class Foo(models.Model):
    NK_SEP = '|||'  # sth unlikely to occur in the other fields

    fruit = models.CharField(max_length=10)
    stuff = models.CharField(max_length=10)
    color = models.CharField(max_length=10)
    natural_key = models.CharField(max_length=40, unique=True, db_index=True)

    @staticmethod
    def get_natural_key(*args):
        return Foo.NK_SEP.join(args) 

    def save(self, *args, **kwargs):
        self.natural_key = Foo.get_natural_key(self.fruit, self.stuff, self.color)
        Super(Foo, self).save(*args, **kwargs)

    objects = FooManager()

    class Meta:
        unique_together = (('fruit', 'stuff', 'color'), )

Now you can query:

from itertools import starmap

lst = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]
existing_foos = Foo.objects.filter(natural_key__in=list(starmap(Foo.get_natural_key, lst)))

And batch create:

Foo.objects.bulk_create(
    [
        Foo(fruit=x[0], stuff=x[1], color=x[2]) 
        for x in lst 
        if x not in set(existing_foos.values_list('fruit', 'stuff', 'color'))
    ]
)

Upvotes: 1

aliva

Reputation: 5720

this is the correct query:

q = Foo.objects.filter(
    Q(fruit='Apple', stuff='Table', color='Blue') |
    Q(fruit='Pear', stuff='Phone', color='Green')
)

also this query will work too (If you don't like Q):

q = Foo.objects.filter(
    fruit='Apple', stuff='Table', color='Blue'
) | Foo.objects.filter(
    fruit='Pear', stuff='Phone', color='Green'
)

Upvotes: 2

iklinac

Reputation: 15738

What you did with the Q is AND between all the where in statements

What you wanted to achieve is OR all the Q with tuple attributes set as following

Foo.objects.filter(Q(fruit='Apple',stuff='Pear',color='Blue)|Q...

To do this programmatic you can do something like the following:

tuple = [('Apple', 'Table', 'Blue'), ('Pear', 'Phone', 'Green')]

query = reduce(lambda q,value: q|Q(fruit=value[0], stuff=value[1], color=value[2]), tuple, Q())  

Foo.objects.filter(query)