Reputation: 160
So I'm trying to get more familiar with Python web scraping and I'm trying to find external links only for a specific function. In the books I'm reading the author implements this by simply removing the "http://" from the string and then seeing if the new link contains the new string (which is the domain name without the preceding "http://".
I can see how this code might fail and although I can simply write an if statement it does make me wonder - is there any way to match all links that start with "http" but not with "http(s)://domain.com"? I tried many different regex solutions that I thought would work but they havent. For example, the variable "site" contains the link address.
re.compile("^((?!"+site+").)^http|www*$"))
re.compile("^http|www((?!"+site+").)*$"))
The results I get would simply be all links that start with http or www and that's not what I Intend to do. Again, I can implement this just fine with an if statement and filter the results, this isn't a complete blocker, but I'm curious about the existance of such a possibility
Any help would be appreciated. I looked around the web but couldn't find anything that matches my use case.
Upvotes: 2
Views: 73
Reputation: 841
To match a string that starts with one string but not with another one, you shoud use this pattern :
^(?!stringyoudontwant)stringyouwant.*
So in your case, this would be :
^(?!https?:\/\/domain\.com)http.*
For this kind of things, you can check out https://regex101.com which is the perfect interface to experiment with complicated regexes.
Upvotes: 1
Reputation: 9257
I'll not recommend you using regex
for this task but i recommend you using urlparse
from urllib.parse
module.
Here is an example:
$> from urllib.parse import urlparse
$> url = urlparse('https://google.com')
ParseResult(scheme='https', netloc='google.com', path='', params='', query='', fragment='')
$> url.scheme
'https'
$> url.netloc
'google.com'
$> urlparse('https://www.google.com')
ParseResult(scheme='https', netloc='www.google.com', path='', params='', query='', fragment='')
Upvotes: 2