Reputation: 26668
I am using python and trying to fetch a particular part of the url as below
from urlparse import urlparse as ue
url = "https://www.google.co.in"
img_url = ue(url).hostname
Result
www.google.co.in
case1:
Actually i will have a number of urls(stored in a list or some where else), so what i want is, need to find the domain name
as above in the url and fetch the part after www.
and before .co.in
, that is the string starts after first dot
and before second dot
which results only google
in the present scenario.
So suppose the url given is url given is www.gmail.com
, i should fetch only gmail
in that, so what ever the url given, the code should fetch the part thats starts with first dot and before second dot.
case2:
Also some urls may be given directly like this domain.com, stackoverflow.com
without www
in the url, in that cases it should fetch only stackoverflow
and domain
.
Finally my intention is to fetch the main name from the url that gmail, stackoverflow, google
like so.....
Generally if i have one url i can use list slicing
and will fetch the string, but i will have a number of ulrs, so need to fetch the wanted part like mentioned above dynamically
Can anyone please let me know how to satisfy the above concept ?
Upvotes: 1
Views: 726
Reputation: 40688
Here is my solution, at the end, domains holds a list of domains you expected.
import urlparse
urls = [
'https://www.google.com',
'http://stackoverflow.com',
'http://www.google.co.in',
'http://domain.com',
]
hostnames = [urlparse.urlparse(url).hostname for url in urls]
hostparts = [hostname.split('.') for hostname in hostnames]
domains = [p[0] == 'www' and p[1] or p[0] for p in hostparts]
print domains # ==> ['google', 'stackoverflow', 'google', 'domain']
First, we extract the host names from the list of URLs using urlparse.urlparse()
. The hostnames list looks like this:
[ 'www.google.com', 'stackoverflow.com, ... ]
In the next line, we break each host into parts, using the dot as the separator. Each item in the hostparts looks like this:
[ ['www', 'google', 'com'], ['stackoverflow', 'com'], ... ]
The interesting work is in the next line. This line says, "if the first part before the dot is www, then the domain is the second part (p[1]). Otherwise, the domain is the first part (p[0]). The domains list looks like this:
[ 'google', 'stackoverflow', 'google', 'domain' ]
My code does not know how to handle login.gmail.com.hk. I hope someone else can solve this problem as I am late for bed. Update: Take a look at the tldextract by John Kurkowski, which should do what you want.
Upvotes: 0
Reputation: 2200
What about using a set of predefined toplevel doamains?
import re
from urlparse import urlparse
#Fake top level domains... EG: co.uk, co.in, co.cc
TOPLEVEL = [".co.[a-zA-Z]+", ".fake.[a-zA-Z]+"]
def TLD(rgx, host, max=4): #4 = co.name
match = re.findall("(%s)" % rgx, host, re.IGNORECASE)
if match:
if len(match[0].split(".")[1])<=max:
return match[0]
else:
return False
parsed = []
urls = ["http://www.mywebsite.xxx.asd.com", "http://www.dd.test.fake.uk/asd"]
for url in urls:
o = urlparse(url)
h = o.hostname
for j in range(len(TOPLEVEL)):
TL = TLD(TOPLEVEL[j], h)
if TL:
name = h.replace(TL, "").split(".")[-1]
parsed.append(name)
break
elif(j+1==len(TOPLEVEL)):
parsed.append(h.split(".")[-2])
break
print parsed
It's a bit hacky, and maybe cryptic for some, but it does the trick, and nothing more has to be done :)
Upvotes: 1
Reputation: 425
Why can't you just do this:
from urlparse import urlparse as ue
urls = ['https://www.google.com', 'http://stackoverflow.com']
parsed = []
for url in urls:
decoded = ue(url).hostname
if decoded.startswith('www.'):
decoded = ".".join(decoded.split('.')[1:])
parsed.append(decoded.split('.')[0])
#parsed is now your parsed list of hostnames
Also, you might want to change the if statement in the for loop, because some domains might start with other things that you would want to get rid of.
Upvotes: 2