gaurav pandey
gaurav pandey

Reputation: 65

Regular expression for ipv6 address

I have one itcl file where my regular expression is getting picked correctly for ipv4 address but the same it is not working for ipv6 address.

I have my expression as:

REGEXP [^:]+://[^:/]+(:[0-9]+)?/?  

which is reading correctly for something like:

https://10.77.56.89

but the same i want to do it for something like:

https://[2001:1:1:43::115]/ucmuser which is showing incorrect format.

Upvotes: 0

Views: 1538

Answers (2)

Peter Lewerin
Peter Lewerin

Reputation: 13252

A more relaxed variant:

% package require ip
1.3
% set addr1 https://10.77.56.89
https://10.77.56.89
% set addr2 {https://[2001:1:1:43::115]/ucmuser}
https://[2001:1:1:43::115]/ucmuser

Just get the ip numbers from the addresses in the simplest possible way*:

% set ip1 [regexp -inline {\d.*\d} $addr1]
10.77.56.89
% set ip2 [regexp -inline {\d.*\d} $addr2]
2001:1:1:43::115

And then validate them:

% ::ip::version $ip1
4
% ::ip::version $ip2
6

*) This method is for illustrative purposes only and will certainly not work for all URLs. The principle is to start with a very simple extraction method and, if valid ip numbers are extracted badly and rejected, refine the method stepwise until it is just complex as it needs to be, and no more.

E.g. if we get an URL like this:

set addr3 http://127.0.0.1/a/b/c/1

the above method will match up to the last digit. However, it's easy to solve this by refining slightly:

% set ip3 [regexp -inline {\d[^/]*\d} $addr3]
127.0.0.1

and so on.

It doesn't have to be a regexp operation either:

set ipX [string trim [lindex [split $addrX /] 2] \[]]

works for all the URLs mentioned here.

Documentation: ip (package), lindex, package, set, split, string, regexp

Upvotes: 0

Donal Fellows
Donal Fellows

Reputation: 137567

The problem is that your regular expression isn't accounting for IPv6 numeric addresses (not that I'd recommend their use in the first place; it's wise to use DNS to bind them to a name in production use).

To examine how things are failing, let's adapt the RE slightly to capture a bit more:

([^:]+)://([^:/]+)(:[0-9]+)?(/?)

In this version, everything that isn't utterly fixed is captured. Now let's test it against your use cases with regexp -inline (the -inline option makes regexp return the matched substrings, which is great for debugging REs, and it really helps to put the RE in a variable and use it like below as that makes it easier to avoid typos):

% set RE {([^:]+)://([^:/]+)(:[0-9]+)?(/?)}
([^:]+)://([^:/]+)(:[0-9]+)?(/?)
% regexp -inline $RE {https://10.77.56.89}
https://10.77.56.89 https 10.77.56.89 {} {}
% regexp -inline $RE {https://[2001:1:1:43::115]/ucmuser}
{https://[2001:1} https {[2001} :1 {}

We see that the [^:]+ part is the problem, as it is stopping at the first colon in the IPv6 address. We need to add a special case when the first part of the hostname begins with [; we won't do full validation (check the ip package in Tcllib if you want that) but we can do some simple stuff by checking that the contents of the brackets are hex digits or colons.

% set RE {([^:]+)://([^]:[/]+|\[[0-9a-f:A-F]+\])(:[0-9]+)?(/?)}
([^:]+)://([^]:[/]+|\[[0-9a-f:A-F]+\])(:[0-9]+)?(/?)
% regexp -inline $RE {https://10.77.56.89}
https://10.77.56.89 https 10.77.56.89 {} {}
% regexp -inline $RE {https://[2001:1:1:43::115]/ucmuser}
{https://[2001:1:1:43::115]/} https {[2001:1:1:43::115]} {} /

That looks right to me (yes, it took a little tinkering to get the syntax right because of the interactions with the syntax for POSIX RE character classes). Converting to have the same capturing groups that you originally had, your RE should be this:

[^:]+://(?:[^]:[/]+|\[[0-9a-f:A-F]+\])(:[0-9]+)?/?

(NB: We're using a non-capturing parenthesis, (?:), in this because we need alternation, |, between two sub-REs.)

Upvotes: 1

Related Questions