Reputation: 793
In Django, is it possible to find duplicates with queryset and regex?
Django select only rows with duplicate field values shows without using a regex:
self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
I have a model:
class company(models.Model):
Website = models.URLField(blank=True, null=True )
I want to find duplicates with regex
For example.
Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')
Both of these are the same website. I'd like to use a regex so that it will return return these companies as duplicates.
I know there is filters like that use regex. I'm not sure how to update this to use a regex:
self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
I'd like to do something like:
Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
Update There was some confusion so I'll give an example.
Here is what my db looks like
Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')
Company.objects.create(Website='http://example.org', Name='a')
Company.objects.create(Website='http://example.org', Name='b')
When I call
Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
It returns:
This is missing that example.com and www.example.com are the same website.
I want to use a regex so that I can tell django that example.com and www.example.com are the same websites.
I want to modify:
Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)
so that it returns the duplicates:
http://example.org (from name=a) and http://example.org (from name=b)
example.com www.example.com
Upvotes: 1
Views: 2284
Reputation: 694
I have similar DB structure on some project - i store urls of some entities. To find duplicates i also store the 'domain' of url.
So regarding your examples the DB structure would be:
id | url | domain
-----------------------------------------
1 | http://www.example.com | example.com
2 | http://example.com | example.com
It's easy then to find duplicates or find urls/entities related to particular domain.
You may think that using such an approach is an overkill to detect duplicates.
But your approach has 2 big drawbacks:
1) it's impossible to write correct regexp that will match domains variations
Matching "www.example.com" and "example.com" is easy. How about "example.co.uk" and "www.example.co.uk" or maybe "www.старт.рф" and "старт.рф" ? Those are all valid domain names.
2) you are shooting your leg in the long run - writing complex regexps on ever growing DB tables will kill your performance.
p.s. - i use "tldextract" lib to get domain of urls.
Upvotes: 0
Reputation: 14190
Use __icontains:
Company.objects.filter(Website__icontains='example.com')
Which will produce:
`ILIKE %'example.com'%.
It will thus return the following records if exists in the Company table:
http://example.com, http://www.example.com
Upvotes: 1