I need a way to recognize urls with similar pattern, e.g. a function which returns true when matched http://mysite.com/page/123 and http://mysite.com/page/456 or http://mysite.com/?page=123 and http://mysite.com/?page=456 or http://mysite.com/?page=123&param=2 and http://mysite.com/?page=456&param=3 I don't need to check validity of urls here, only find out if the pattern is the same. I probably need a regular expression for it, but can't figure out how to do it. Can anyone help? Thanks.

Reputation: 12706

How to recognize urls with similar patterns in C#?

I need a way to recognize urls with similar pattern, e.g. a function which returns true when matched

http://mysite.com/page/123
and
http://mysite.com/page/456

http://mysite.com/?page=123
and
http://mysite.com/?page=456

http://mysite.com/?page=123&param=2
and
http://mysite.com/?page=456&param=3

I don't need to check validity of urls here, only find out if the pattern is the same. I probably need a regular expression for it, but can't figure out how to do it. Can anyone help? Thanks.

Upvotes: 0

Answers (3)

Simon MᶜKenzie

Reputation: 8664

Not a specific answer, but I feel that if you want this to work well in a generalised sense, you will need to be content-aware, i.e. you need to break each URL into subsections:

Protocol
Domain
Path
Querystrings

... And process each separately. The level of acceptable fuzziness will control how much you need to break up the URL, but each section would (I feel) need quite specific inspection. The protocol and domain could be straight string matches, but the paths could perhaps be split by '/' and then after basic length checks, the elements could be compared one by one, only comparing items of equal depth (using direct equality or a "change distance" like the Levenshtein distance mentioned earlier). The querystrings could be broken up into dictionaries via a simple split on "&" then by "=", which you could sort and compare however you want. This would also satisfy @MarcGravell's question about reordered querystring parameters.

Upvotes: 2

mehmet6parmak

Reputation: 4857

May be you can try levenshtein distance http://www.dotnetperls.com/levenshtein, which is used to find similarity between strings.

Upvotes: 3

Reactormonk

Reputation: 21690

Use a lowest common subsequence algorithm and divide by the length of either of the strings. If it's above an arbitrary number, they're common enough.

Upvotes: 2

How to recognize urls with similar patterns in C#?

Answers (3)

Related Questions