Reputation: 1147
I am trying to collect a set of URLs, using BeautifulSoup, with a very specific criteria. The URLs I want to collect must contain /b-\d+
(/b-
followed by a series of numeric values). However, I want to ignore all URLs containing View%20All
even if it has /b-\d+
in it.
Here are a sample of URLs:
1. http://www.foo.com/bar/b-12312903?sName=View%20All
2. http://www.foo.com/bar/b-832173712873?sName=View%20All
3. http://www.foo.com/bar/b-1208313109283129
4. http://www.foo.com/bar/b-2198123371239489?adCell=W3
Given the above sample, the valid URLs that I want to collect are #3 and #4. I have tried using different negative lookahead regular expressions and they just aren't working for me:
{"href" : re.compile(r"\/b-\d+.+(?!View\%20All)")}
{"href" : re.compile(r"^.+\/b-\d+.+(?!View\%20All$)")}
Can someone tell me what I am doing wrong?
Upvotes: 4
Views: 3087
Reputation: 2282
^.*?/b-\d+(?:(?!View%20All).)*$
Or much faster
^.+?/b-\d+(?:[^V]+|V(?!iew%20All))*$
Upvotes: 1
Reputation: 26667
{"href" : re.compile(r"\/b-\d+.+(?!View\%20All)")}
{"href" : re.compile(r"^.+\/b-\d+.+(?!View\%20All$)")}
where you got wrong?
when we give (?!View\%20All)
it asserts that the View\%20All
cannot be matched immediately following the previous pattern which is .+
in effect it means that the look ahead is always true
to illustrate lets check what is matched at by each pattern
http://www.foo.com/bar/b-12312903?sName=View%20All
/b-
is obvious
\d
matches 12312903
now the problem arises,
.+
matches anything such that it makes the negative assertion (?!View\%20All)
successful.
that is say
.
matches ?s
string that is left unmatched is sName=View%20All
which doesn't match (?!View\%20All)
at the beginning position s
hence always successful matching lines 1 and line 2
demo to get a clear image.
Fix??
when using lookaround assertions, fix the positions from where the checking starts
say using a regex like
(\/b-\d+)(\?|$)(?!sName=View\%20All)
which will match 3 and 4 as
http://regex101.com/r/aS5yS2/1
here ?
or $
within the string fixes the position from where the negative assertion starts.
Upvotes: 4