Michael
Michael

Reputation: 22957

Regex for getting domain of a URL Excel VB

I have an excel file with urls of type http://test.example.com/anything...

i want to make it http://test.example.com

does someone know the regex i should use ? (i got a macro in VB for the replace, i just need the regex)

thanks

Public Function SearchNReplace1(Pattern1 As String, _
   Pattern2 As String, Replacestring As String, _
   TestString As String)
   Dim reg As New RegExp

   reg.IgnoreCase = True
   reg.MultiLine = False
   reg.Pattern = Pattern1
   If reg.Test(TestString) Then
     reg.Pattern = Pattern2
     SearchNReplace1 = reg.Replace(TestString, Replacestring)
   Else
     SearchNReplace1 = TestString
   End If
End Function

Upvotes: 1

Views: 2200

Answers (2)

vbence
vbence

Reputation: 20333

from: ([a-z]+://[a-z0-9.-]+)[^ ]* to: \1

This will eat enything after the domain name until encountees a space or end of string. Please give more details if this one does not suit you.

If you need ipv6 addresses as hosts you have to allow []: character too:

from: ([a-z]+://[a-z0-9.\[\]:-]+)[^ ]* to: \1

Upvotes: 3

ridgerunner
ridgerunner

Reputation: 34395

RFC-3986 Appendix B. gives us this regex for decomposing a generic URI:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

Since you are interested in plucking out everything up to the path, here is an equivalent regex which should work quite nicely (in PHP syntax to allow comments):

$re = '%# Match URI and capture scheme and path in $1.
^                  # Anchor to beginning of string.
(                  # $1: Everything up to path.
  (?: [^:/?#]+:)?  # Optional scheme.
  (?://[^/?#]* )?  # Optional authority.
)                  # End $1: Everything up to path.
        [^?#]*     # Required path.
(?:\?    [^#]* )?  # Optional query.
(?:\#       .* )?  # Optional fragment.
$                  # Anchor to end of string.
%x';

And here is the exact same regex, in short form, that should work in VB:

myRegExp.Pattern = "^((?:[^:/?#]+:)?(?://[^/?#]*)?)[^?#]*(?:\?[^#]*)?(?:#.*)?$"

This regex does not validate the URI, it just decomposes it into its various components, and pluck out the part you need into capture group 1. Note that every component but the path is optional (and the path, itself, may be empty). In other words, an empty string is a valid URI!

Upvotes: 0

Related Questions