Reputation: 2970
I'm looking for a way to get a queryset for possible duplicates in the database.
Simplified models:
class Artist(models.Model):
name = models.CharField(max_length=255, db_index=True)
class Track(models.Model):
name = models.CharField(max_length=255, db_index=True)
artist = models.ForeignKey(related_name='tracks')
What is working so far is a query to get 'Tracks' with equal names:
(not so elegant, however the speed does not matter much as the query is only used in infrequent maintenance work):
qs = Track.objects.all()
duplicates = Track.objects.values('name')\
.annotate(Count('id'))\
.filter(id__count__gt=1)
qs = qs.filter(name__in=[item['name'] for item in duplicates])
Any ideas how to extend this to get a queryset where Track.name and the related Artist.name are possible duplicates?
Upvotes: 0
Views: 357
Reputation: 5249
Getting a single query set that spans across models that don't have a common inheritance parent is a bad smell. I'd stick to the simple list instead. Any QS logic doesn't translate to such an construct.
The below one is quite speedy. It makes use of the indexes. Please mind the empty .order_by()
. If your models don't have ordering specified in the Meta
class, you can skip it.
Get all Track
names:
track_names = Track.objects.order_by().values_list('name', flat=True).distinct()
Get all Artist
names:
artist_names = Artist.objects.order_by().values_list('name', flat=True).distinct()
Then get an intersection:
duplicated_names = set(track_names) & set(artist_names)
That way you get the names that are both in Artist
and Track
models.
To get them, simply:
Track.objects.filter(name__in=duplicated_names)
To find the duplicates in single model, use the method you've already quoted.
Upvotes: 1