Peter
Peter

Reputation: 131

Bash based regex domain name validation

I want to create a script that will add new domains to our DNS Servers. I found that Fully qualified domain name validation REGEX. However, when I use it with sed, it is not working as I would expect:

echo test | sed  '/(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(:[a-zA-Z]{2,})$)/p'  
--------
Output is: 
test
echo test.com | sed  '/(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(:[a-zA-Z]{2,})$)/p'  
--------
Output is: 
test.com

I expected that the output of the first command should be a blank line. What do I do wrong?

Upvotes: 2

Views: 14106

Answers (6)

Doktor J
Doktor J

Reputation: 1118

I find this to be a more comprehensive regex:

(?=^.{4,253}$)(^(?:[a-zA-Z0-9](?:(?:[a-zA-Z0-9\-]){0,61}[a-zA-Z0-9])?\.)+([a-zA-Z]{2,}|xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])$)

  • RFC 1034§3: Allows for a length of 4-253, with the shortest operational domain I'm aware of, "t.co", still matching where the other answers don't. 255 bytes is the maximum length, minus the length octet for each label (TLD and "primary" subdomain) gives us 253: (?=^.{4,253}$)
    • RFC 3696§2: Single-letter TLDs are technically permitted, meaning the minimum length would be 3, but as there are currently no single-letter TLDs a minimum length of 4 is practical.
  • RFC 1034§3: Allows numbers in subdomains, which Conor Clafferty's apparently doesn't (by not distinguishing other subdomains from "primary" subdomains -- i.e. the domain you register -- which the DNS spec doesn't)
  • RFC 1034§3: Restricts individual labels to 63 characters, permitting hyphens in the middle while restricting the beginning and end to alphanumerics (?:[a-zA-Z0-9](?:(?:[a-zA-Z0-9\-]){,61}[a-zA-Z0-9])?\.)
  • Requires a two-letter or larger TLD, but may be punycoded ([a-zA-Z]{2,}|xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])
    • RFC 3696§2: The DNS spec technically permits numerics in the TLD, as well as single-letter TLDs; however, there are currently no single-letter TLDs or TLDs with numbers currently, and all-numeric TLDs are not permitted, so this part of the regex has been simplified to [a-zA-Z]{2,}.

      --OR--

    • RFC 3490§5: an internationalized domain name ccTLD (IDN ccTLD) may be punycoded, as indicated by an "xn--" prefix, after which it may contain letters, numbers, or hyphens. This approximates to xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9]

      Be aware that this pattern does not validate a punycode TLD! Invalid punycode will be tolerated, e.g. "xn--qqqq", because attempting to validate punycode against the appropriate encoding mechanisms is beyond the scope of a regular expression. While punycode itself technically permits an encoded string ending in a hyphen, RFC 3492§5 observes and respects the IDNA limitation that labels may not end in a hyphen.

EDIT 02/2021: Hat tip to user2241415 for pointing out that IDN ccTLDs did not match the previously-specified regex.

Upvotes: 14

Dirk Hoffmann
Dirk Hoffmann

Reputation: 1563

if the domain has to exist you can try:

$ cat test.sh
#!/bin/bash

for h in "bert" "ernie" "www.google.com"
do
    host $h 2>&1 > /dev/null
    if [ $? -eq 0 ]
    then
        echo "$h is a FQDN"
    else
        echo "$h is not a FQDN"
    fi
done

jalderman@mba:/tmp$ ./test.sh 
bert is not a FQDN
ernie is not a FQDN
www.google.com is a FQDN

Upvotes: -2

Bob van Luijt
Bob van Luijt

Reputation: 7588

I use grep -P to do this.

echo test | grep -P "^[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,})+$" 
--------
Output is: 

echo www.test.com | grep -P "^[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,})+$"
--------
Output is: www.test.com

Upvotes: 0

tripleee
tripleee

Reputation: 189307

No sed implementation I am aware of supports the various Perl extensions you are using in that regex. Try with Perl or grep -P or pcregrep, or simplify the regex to something sed can cope with. Here is a quick and dirty adaptation which splits the regex into a script of three different regexes, and rejects when something fails to match (or matches, in the middlemost case).

echo 'test' | sed -r '/^.{5,254}$/!d
    /^([^.]*\.)*[0-9]+\./d   # Seems incorrect; 112.com is valid
    /^([a-zA-Z0-9_\-]{1,63}\.?)+([a-zA-Z]{2,})$/!d'  # should disallow underscore
    # also, what's with the question mark after the literal dot?

This also completely fails to accept IDNA domains (which can contain dashes and numbers in the TLD, among other things) so I would definitely not recommend this, but hopefully it shows you how to adapt something like this to sed if you wish to.

Upvotes: 1

Conor Clafferty
Conor Clafferty

Reputation: 171

Pierre-Louis' answer didn't quite work for me. e.g. "kittens" is considered a domain name. I added one slight adjustment to ensure that the domain at least had a dot in it.

(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+\.(?:[a-z]{2,})$)

Theres an extra \. just before it reads the last portion of the domain.

Upvotes: 0

Pilou
Pilou

Reputation: 1478

You are missing a question mark in your regex :

(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)

You can test your regex here

You can do what you want with grep :

$ echo test.com | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)'
test.com
$ echo test | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)'
$

Upvotes: 3

Related Questions