Mohamed Moustafa
Mohamed Moustafa

Reputation: 509

How to retrieve the domain of a web archived website using the archived url in Python?

Given a url such as :

http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html

Is there a way (using some library, package, or vanilla Python) to retrieve the domain "www.feralhouse.com"?

I thought of simply using split at "www", split the second-index item at "com", and re-group the first-index item like:

url = "http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html"
url1=url.split("www")
url2=url1[1].split("com")
desired_output = "www"+url2[0]+"com"
print(desired_output)
#www.feralhouse.com

But there are some exceptions to this method (sites with no www, I assume they rely on the browser automatically changing that). I would prefer a less "hacky" approach if possible. Thanks in advance!

NOTE: I dont want a solution just for this SPECIFIC url, I want a solution for all possible archived urls.

EDIT: Another example url

http://web.archive.org/web/20000614170338/http://www.clonejesus.com/

Upvotes: 0

Views: 269

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195573

Two methods, one with split, one with re module:

s = 'http://web.archive.org/web/20010312011552/www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html'

print(s.split('/', 5)[-1])

import re

print(re.findall(r'\d{14}/(.*)', s)[0])

Prints:

www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html
www.feralhouse.com/cgi-bin/store/commerce.cgi?page=ac2.html

Upvotes: 2

Related Questions