Learning
Learning

Reputation: 1191

Parse Out Root Domain Using RegEx & Pre-Determined List of TLD's

I'd like to use a RegEx to parse out the root domain of a given input URL. I already know that there is basically no RegEx out there that can't be "broken" given the appropriate input URL, which is why I'd like to restrict the usage of a given RegEx to a list of given TLD's (if it's possible). Here is an example:

Lets say I've got an input file and will be running each URL in the file through the regex one at a time. Here is the input file:

www.google.co.uk
www.google.co.uk/something
www.google.com/
www.google.com/something
google.com/
google.com/something
subdomain.google.com/
subdomain.google.com/something
www.subdomain.google.com/
www.google.net/
www.google.net/something
google.net/

The final result, should be this:

google.co.uk
google.co.uk
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.com

The important thing I'd like though, is for the regex to parse based on the following:

Find the TLD in the given URL from a list of given TLDs (for instance:

(co.uk|com|net|edu|gov|etc|etc|etc)

IF one of the given TLD's is found THEN match & parse out everything to the left of (and including) that TLD that it found, UP UNTIL it either reaches the beginning of the line OR it reaches another "."

If it's possible to write a regex that matches based on that "pseudo-code" description given, it should parse out the sample input data exactly as shown.

Upvotes: 0

Views: 1199

Answers (3)

user2466962
user2466962

Reputation: 1

Actually there is no way to parse an uri using a regex for lots of reasons. For exemple, localhost, 192.168.0.43, www.google.co.uk are all valid.

However, if you extract the last element before the '.', you don't want '43' from your IP address as a TLD, there there are many exceptions (co.uk and bl.uk have two different behaviors).

I wrote a C library/Python bindings and command line tool available there: http://www.github.com/stricaud/faup so you can do things like:

$ faup -p www.example.com
scheme,credential,subdomain,domain,host,tld,port,resource_path,query_string,fragment
,,www,example.com,www.example.com,com,,,,

To get the domain, you can have a file with all of them, and run it through faup:

$ cat urls.txt |faup -f domain
google.co.uk
google.co.uk
google.com
google.com
google.com
google.com
google.com
google.com
google.com
google.net
google.net
google.net

if you just want the tld, you can use the -f tld parameter, such as:

$ faup -f tld www.example.com
com

Or even, get a json output:

$ faup -o json http://www.test.co.uk/index.html?foo=bar#tagada
{
    "scheme": "http",
    "credential": "",
    "subdomain": "www",
    "domain": "test.co.uk",
    "host": "www.test.co.uk",
    "tld": "co.uk",
    "port": "",
    "resource_path": "/index.html",
    "query_string": "?foo=bar",
    "fragment": "#tagada"
}

Not only this is faster than a regex, but that deals with all the specific cases you encounter whenever you want to do things as simple as domain/tld extraction as you want here.

Upvotes: 0

LaGrandMere
LaGrandMere

Reputation: 10359

In Java :

package test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {

    /**
     * @param args
     */
    public static void main(String[] args) {

        String subject = "www.google.co.uk\nwww.google.co.uk/something\nwww.google.com/\nwww.google.com/something\ngoogle.com/\ngoogle.com/something\nsubdomain.google.com/\nsubdomain.google.com/something\nwww.subdomain.google.com/\nwww.google.net/\nwww.google.net/something\ngoogle.net/\n";
        Pattern pattern = Pattern.compile("(\\w+)\\.(co.uk|com|net|edu|gov)");

        Matcher m = pattern.matcher(subject);
        int count = 0;
           while(m.find()) {
               count++;
               System.out.println(m.group());
          }
    }
}

Regex = (\w+)\.(co.uk|com|net|edu|gov)

Upvotes: 1

Mutant Bob
Mutant Bob

Reputation: 3549

perl -ne 'print $2, "\n" if m-^([^/]+?\.|)([^./]*\.(co\.uk|com|net|edu|gov|etc|etc|etc))(/.*|)$-'  /tmp/x.txt

seems to give the results you are looking for, at least on the sample data you provided (assuming you don't want to translate google.net to google.com ).

Note that I did get a little lazy with my [^./], which could match characters which are not legal in domain names. Then again, i18n has probably rewritten the rules for DNS to include a lot more characters than when I was young.

Upvotes: 2

Related Questions