Reputation: 2398
I want to input a URL and extract the domain name which is the string that comes after http:// or https:// and contains strings, numbers, dots, underscores, or dashes.
I wrote the regex and used the python's re
module as follows:
import re
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
m.group(1)
print(m)
My understanding is that m.group(1)
will extract the part between () in the re.search.
The output that I expect is: google.co.uk
But I am getting this:
<_sre.SRE_Match object; span=(0, 35), match='https://google.co.uk?link=something'>
Can you point to me how to use re
to achieve my requirement?
Upvotes: 4
Views: 17436
Reputation: 467
The easiest way to do it is by the package urllib
from urllib.parse import urlsplit
s = "https://google.co.uk?link=something"
urlsplit(s).netloc
output of this is
'google.co.uk'
Upvotes: 2
Reputation: 17
Jan has already provided solution for this. But just to note, we can implement the same without using re
. All it needs is !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
for validation purposes. The same can be obtained from string
package.
def domain_finder(link):
import string
dot_splitter = link.split('.')
seperator_first = 0
if '//' in dot_splitter[0]:
seperator_first = (dot_splitter[0].find('//') + 2)
seperator_end = ''
for i in dot_splitter[2]:
if i in string.punctuation:
seperator_end = i
break
if seperator_end:
end_ = dot_splitter[2].split(seperator_end)[0]
else:
end_ = dot_splitter[2]
domain = [dot_splitter[0][seperator_first:], dot_splitter[1], end_]
domain = '.'.join(domain)
return domain
link = 'https://google.co.uk?link=something'
domain = domain_finder(link=link)
print(domain) # prints ==> 'google.co.uk'
This is just another way of solving the same without re
.
Upvotes: 1
Reputation: 309
There is an library called tldextract which is very reliable in this case.
Here is how it will work
import tldextract
def extractDomain(url):
if "http" in str(url) or "www" in str(url):
parsed = tldextract.extract(url)
parsed = ".".join([i for i in parsed if i])
return parsed
else: return "NA"
op = open("out.txt",'w')
# with open("test.txt") as ptr:
# for lines in ptr.read().split("\n"):
# op.write(str(extractDomain(lines)) + "\n")
print(extractDomain("https://test.pythonhosted.org/Flask-Mail/"))
output as follows,
test.pythonhosted.org
Upvotes: 0
Reputation: 43169
You need to write
print(m.group(1))
Even better yet - have a condition before:
m = re.search('https?://([A-Za-z_0-9.-]+).*', 'https://google.co.uk?link=something')
if m:
print(m.group(1))
Upvotes: 10