John
John

Reputation: 634

get domain from any url

I have these urls

http://www.domain.co.uk&affiliate=adwords&ved=0CPsCENEM
http://www.domain.co.uk:affiliate=adwords&ved=0CPsCENEM
http://www.domain.co.uk]affiliate=adwords&ved=0CPsCENEM
http://www.domain.com[affiliate=adwords&ved=0CPsCENEM

How can I get the domain from those urls even if I have any character after the TLD?

At the moment I am using the below regex, but this will work only if after the TLD I will have /

https?:\/\/(?!.*https?:\/\/)(?:www\.)([\da-z\.-]+)\.([a-z\.]{2,9})

Upvotes: 1

Views: 275

Answers (3)

Saksham Varma
Saksham Varma

Reputation: 2130

You can use python's urlparse.

import urlparse
s = urlparse.urlsplit('http://www.domain.co.uk&affiliate=adwords&ved=0CPsCENEM').netloc
ind = 0
parts = s.split('.')
if 'www' in parts:
    ind = parts.index('www') + 1
print parts[ind]

Upvotes: 2

Bohemian
Bohemian

Reputation: 425073

This should work:

://.*?(\w+)([^\w.]|$)

Use group 1 of the match.

See demo

Upvotes: 1

hek2mgl
hek2mgl

Reputation: 158040

In comments you told you are using Ruby. Having that the urls are stored in urls.txt you can follow this example:

File.open("urls.txt", "r") do |file_handle|
    file_handle.each_line do |url|
        url =~ /^[^:]+:\/\/((\.?[a-z0-9]+)+)/
        domain = $1
        print "#{domain}\n"
    end 
end

Explanation:

The regex is based on the fact that any delimiter you might imagine of must at least follow one rule: it is a character which is not allowed in domain or hostnames. The allowed characters in domain or host names are [0-9a-z-]. (Note that unicode characters are allowed as well, I don't care about this fact in my answer so far)

^              Matches the start of the string
[^:]           Character class. Matches any character except from `:`
+              The previous match needs to occur 1 or more times
:\/\/          The :// after the url protocol
(              Start of outer matching group for the whole domain ($1)
(              Begin of inner matching group. Matches sub domain
\.?            A literal dot. Optionally
[a-z0-9-]+     Sub domain, host name or TLD. At least one character
)              End of inner matching group
+              Endless sub domains but at least one host name are allowed
)              End of outer matching group

The domain name will be available via the first capturing group $1.


First Answer

It depends on the regex engine.

The following regex can being used with perl compatible regexes (pcre):

grep -ioP '^[^:]+://\K(\.?[a-z0-9]+)+'

Having extended POSIX regexes and awk you might use:

awk -F'(://|[^0-9a-zA-Z.])' '{print $2}'

...

Upvotes: 1

Related Questions