Reputation: 509
Given a url such as :
http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html
Is there a way (using some library, package, or vanilla Python) to retrieve the domain "www.feralhouse.com"?
I thought of simply using split at "www", split the second-index item at "com", and re-group the first-index item like:
url = "http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html"
url1=url.split("www")
url2=url1[1].split("com")
desired_output = "www"+url2[0]+"com"
print(desired_output)
#www.feralhouse.com
But there are some exceptions to this method (sites with no www, I assume they rely on the browser automatically changing that). I would prefer a less "hacky" approach if possible. Thanks in advance!
NOTE: I dont want a solution just for this SPECIFIC url, I want a solution for all possible archived urls.
EDIT: Another example url
http://web.archive.org/web/20000614170338/http://www.clonejesus.com/
Upvotes: 0
Views: 269
Reputation: 195573
Two methods, one with split, one with re
module:
s = 'http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html'
print(s.split('/', 5)[-1])
import re
print(re.findall(r'\d{14}/(.*)', s)[0])
Prints:
www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html
www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html
Upvotes: 2