Reputation: 634
I have these urls
http://www.domain.co.uk&affiliate=adwords&ved=0CPsCENEM
http://www.domain.co.uk:affiliate=adwords&ved=0CPsCENEM
http://www.domain.co.uk]affiliate=adwords&ved=0CPsCENEM
http://www.domain.com[affiliate=adwords&ved=0CPsCENEM
How can I get the domain from those urls even if I have any character after the TLD?
At the moment I am using the below regex, but this will work only if after the TLD I will have /
https?:\/\/(?!.*https?:\/\/)(?:www\.)([\da-z\.-]+)\.([a-z\.]{2,9})
Upvotes: 1
Views: 275
Reputation: 2130
You can use python's urlparse.
import urlparse
s = urlparse.urlsplit('http://www.domain.co.uk&affiliate=adwords&ved=0CPsCENEM').netloc
ind = 0
parts = s.split('.')
if 'www' in parts:
ind = parts.index('www') + 1
print parts[ind]
Upvotes: 2
Reputation: 425073
This should work:
://.*?(\w+)([^\w.]|$)
Use group 1 of the match.
See demo
Upvotes: 1
Reputation: 158040
In comments you told you are using Ruby
. Having that the urls are stored in urls.txt
you can follow this example:
File.open("urls.txt", "r") do |file_handle|
file_handle.each_line do |url|
url =~ /^[^:]+:\/\/((\.?[a-z0-9]+)+)/
domain = $1
print "#{domain}\n"
end
end
Explanation:
The regex is based on the fact that any delimiter you might imagine of must at least follow one rule: it is a character which is not allowed in domain or hostnames. The allowed characters in domain or host names are [0-9a-z-]
. (Note that unicode characters are allowed as well, I don't care about this fact in my answer so far)
^ Matches the start of the string
[^:] Character class. Matches any character except from `:`
+ The previous match needs to occur 1 or more times
:\/\/ The :// after the url protocol
( Start of outer matching group for the whole domain ($1)
( Begin of inner matching group. Matches sub domain
\.? A literal dot. Optionally
[a-z0-9-]+ Sub domain, host name or TLD. At least one character
) End of inner matching group
+ Endless sub domains but at least one host name are allowed
) End of outer matching group
The domain name will be available via the first capturing group $1
.
First Answer
It depends on the regex engine.
The following regex can being used with perl compatible regexes (pcre):
grep -ioP '^[^:]+://\K(\.?[a-z0-9]+)+'
Having extended POSIX regexes and awk
you might use:
awk -F'(://|[^0-9a-zA-Z.])' '{print $2}'
...
Upvotes: 1