Reputation: 12706
I need a way to recognize urls with similar pattern, e.g. a function which returns true
when matched
http://mysite.com/page/123
and
http://mysite.com/page/456
or
http://mysite.com/?page=123
and
http://mysite.com/?page=456
or
http://mysite.com/?page=123¶m=2
and
http://mysite.com/?page=456¶m=3
I don't need to check validity of urls here, only find out if the pattern is the same. I probably need a regular expression for it, but can't figure out how to do it. Can anyone help? Thanks.
Upvotes: 0
Views: 251
Reputation: 8664
Not a specific answer, but I feel that if you want this to work well in a generalised sense, you will need to be content-aware, i.e. you need to break each URL into subsections:
... And process each separately. The level of acceptable fuzziness will control how much you need to break up the URL, but each section would (I feel) need quite specific inspection. The protocol and domain could be straight string matches, but the paths could perhaps be split by '/' and then after basic length checks, the elements could be compared one by one, only comparing items of equal depth (using direct equality or a "change distance" like the Levenshtein distance mentioned earlier). The querystrings could be broken up into dictionaries via a simple split on "&" then by "=", which you could sort and compare however you want. This would also satisfy @MarcGravell's question about reordered querystring parameters.
Upvotes: 2
Reputation: 4857
May be you can try levenshtein distance http://www.dotnetperls.com/levenshtein, which is used to find similarity between strings.
Upvotes: 3
Reputation: 21690
Use a lowest common subsequence algorithm and divide by the length of either of the strings. If it's above an arbitrary number, they're common enough.
Upvotes: 2