brian
brian

Reputation: 793

Django find duplicates with queryset and regex

In Django, is it possible to find duplicates with queryset and regex?

Django select only rows with duplicate field values shows without using a regex:

self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

I have a model:

class company(models.Model):
   Website = models.URLField(blank=True, null=True )

I want to find duplicates with regex

For example.

Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')

Both of these are the same website. I'd like to use a regex so that it will return return these companies as duplicates.

I know there is filters like that use regex. I'm not sure how to update this to use a regex:

self.values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

I'd like to do something like:

Website__iregex='http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

Update There was some confusion so I'll give an example.

Here is what my db looks like

Company.objects.create(Website='http://example.com')
Company.objects.create(Website='http://www.example.com')
Company.objects.create(Website='http://example.org', Name='a')
Company.objects.create(Website='http://example.org', Name='b')

When I call

Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

It returns:

  1. http://example.org (from name=a) and http://example.org (from name=b)

This is missing that example.com and www.example.com are the same website.

I want to use a regex so that I can tell django that example.com and www.example.com are the same websites.

I want to modify:

Company.objects.all().values('Website').annotate(count=Count('id')).order_by().filter(count__gt=1)

so that it returns the duplicates:

  1. http://example.org (from name=a) and http://example.org (from name=b)

  2. example.com www.example.com

Upvotes: 1

Views: 2284

Answers (2)

Termos
Termos

Reputation: 694

I have similar DB structure on some project - i store urls of some entities. To find duplicates i also store the 'domain' of url.

So regarding your examples the DB structure would be:

id |           url          | domain
-----------------------------------------
1  | http://www.example.com | example.com
2  | http://example.com     | example.com

It's easy then to find duplicates or find urls/entities related to particular domain.

You may think that using such an approach is an overkill to detect duplicates.

But your approach has 2 big drawbacks:

1) it's impossible to write correct regexp that will match domains variations

Matching "www.example.com" and "example.com" is easy. How about "example.co.uk" and "www.example.co.uk" or maybe "www.старт.рф" and "старт.рф" ? Those are all valid domain names.

2) you are shooting your leg in the long run - writing complex regexps on ever growing DB tables will kill your performance.

p.s. - i use "tldextract" lib to get domain of urls.

Upvotes: 0

dan-klasson
dan-klasson

Reputation: 14190

Use __icontains:

Company.objects.filter(Website__icontains='example.com')

Which will produce:

`ILIKE %'example.com'%. 

It will thus return the following records if exists in the Company table:

 http://example.com, http://www.example.com

Upvotes: 1

Related Questions