Reputation: 532
I am trying to extract the domain name from various websites. Here are the websites:
1. "www.xakep.ru" should equal "xakep"
2. "http://www.fk3vmxex20vzn4ddp.info/default.html" should equal "fk3vmxex20vzn4ddp"
3. "https://hxin2wz7bkx9oicndd28y6m6i7n.us/img/" should equal "hxin2wz7bkx9oicndd28y6m6i7n"
4. "iccan.org" should equal "iccan"
5. "0iwb0awri.br/warez/" should equal "0iwb0awri"
6. "http://www.google.com/" should equal "google"
My code:
import re
url = "www.xakep.ru"
regex = re.compile(r'(://|www.)+([a-zA-Z-_0-9]+)')
match = regex.search(url)
print(match.group(2))
I am having problem in string without http or www in them.
Upvotes: 1
Views: 116
Reputation: 785128
You may use this regex with 2 optional matches:
^(?:https?://)?(?:www\.)?([^.]+)
RegEx Details:
^
: Start(?:https?://)?
: optionally match http://
or https://
(?:www\.)?
: optionally match www.
([^.]+)
: Match 1+ of any character that is not a DOT in capture group #1Upvotes: 2
Reputation: 9
I know that you asked for using RE for that, but normally I'd not recommend to do such thing "manually", because it is easy to get it wrong.
The function you are looking for is in python's urllib and should provide everything you want: https://docs.python.org/3/library/urllib.parse.html
When you get the hostname from the urlsplit function, getting the domain name from that is much easier than trying to parse any URL. But then, I might be lazy here.
Upvotes: 0