Reputation: 45
I am working on a webcrawler, where I am trying to make a regex to support the following.
Match: all pages starting with
http://intranet/
But not starting with
http://intranet/sites/ and http://intranet/search/
And in the subfolder /Pages/ Ending with .aspx
Valid sample:
http://intranet/products/Pages/default.aspx
Invalid samples:
http://intranet/Pages/sofus/default.aspx
http://intranet/sites/products/Pages/default.aspx
http://intranet/products/Pages/default.aspx#
So far I have made this
^http://intranet.*/Pages/.*.aspx+
Any help appreciated.
Upvotes: 3
Views: 278
Reputation: 149040
A pattern like this should work:
^http://intranet/(?!sites|search)[^/]+/Pages/.*\.aspx$
The (?!...)
creates what's known as a negative lookahead assertion and ensure that the [^/]+
does not start with sites
or search
.
Here's a demonstration.
Upvotes: 4