ohrstrom
ohrstrom

Reputation: 2970

Django ORM filter queryset for duplicates

I'm looking for a way to get a queryset for possible duplicates in the database.

Simplified models:

class Artist(models.Model):
    name = models.CharField(max_length=255, db_index=True)

class Track(models.Model):
    name = models.CharField(max_length=255, db_index=True)
    artist = models.ForeignKey(related_name='tracks')

What is working so far is a query to get 'Tracks' with equal names:
(not so elegant, however the speed does not matter much as the query is only used in infrequent maintenance work):

qs = Track.objects.all()
duplicates = Track.objects.values('name')\
    .annotate(Count('id'))\
    .filter(id__count__gt=1)    
qs = qs.filter(name__in=[item['name'] for item in duplicates])

Any ideas how to extend this to get a queryset where Track.name and the related Artist.name are possible duplicates?

Upvotes: 0

Views: 357

Answers (1)

Krzysztof Szularz
Krzysztof Szularz

Reputation: 5249

Getting a single query set that spans across models that don't have a common inheritance parent is a bad smell. I'd stick to the simple list instead. Any QS logic doesn't translate to such an construct.

The below one is quite speedy. It makes use of the indexes. Please mind the empty .order_by(). If your models don't have ordering specified in the Meta class, you can skip it.

Get all Track names:

track_names = Track.objects.order_by().values_list('name', flat=True).distinct()

Get all Artist names:

artist_names = Artist.objects.order_by().values_list('name', flat=True).distinct()

Then get an intersection:

duplicated_names = set(track_names) & set(artist_names)

That way you get the names that are both in Artist and Track models.

To get them, simply:

Track.objects.filter(name__in=duplicated_names)

To find the duplicates in single model, use the method you've already quoted.

Upvotes: 1

Related Questions