techguy1988
techguy1988

Reputation: 13

Regex Classic ASP

I've currently got a string which contains a URL, and I need to get the base URL.

The string I have is http://www.test.com/test-page/category.html

I am looking for a RegEx that will effectively remove any page/folder names at the end. The issue is that some people may enter the domain in the following formats:

http://www.test.com
www.test.co.uk/
www.test.info/test-page.html
www.test.gov/test-folder/test-page.html

It must return http://www.websitename.ext/ each time i.e. the domain name and extension (e.g. .info .com .co.uk etc) with a forward slash at the end.

Effectively it needs to return the base URL, without any page/folder names. Is there any easy way to do with with a Regular Expression?

Thanks.

Upvotes: 0

Views: 3400

Answers (2)

DavidRR
DavidRR

Reputation: 19457

My approach: Use a RegEx to extract the domain name. Then add http: to the front and / to the end. Here's the RegEx:

^(?:http:\/\/)?([\w_]+(?:\.[\w_]+)+)(?=(?:\/|$))

Also see this answer to the question Extract root domain name from string. (It left me somewhat disatisfied, although pointed out the need to account for https, the port number, and user authentication info which my RegEx does not do.)

Here is an implementation in VBScript. I put the RegEx in a constant and defined a function named GetDomainName(). You should be able to incorporate that function in your ASP page like this:

normalizedUrl = "http://" & GetDomainName(url) & "/"

You can also test my script from the command prompt by saving the code to a file named test.vbs and then passing it to cscript:

cscript test.vbs

Test Program

Option Explicit

Const REGEXPR = "^(?:http:\/\/)?([\w_]+(?:\.[\w_]+)+)(?=(?:\/|$))"
'                    ^^^^^^^^^   ^^^^^^   ^^^^^^^^^^       ^^^^
'                        A         B1         B2            C
'
' A  - An optional 'http://' scheme
' B1 - Followed by one or more alpha-numeric characters
' B2 - Followed optionally by one or more occurences of a string
'      that begins with a period that is followed by
'      one or more alphanumeric characters, and
' C  - Terminated by a slash or nothing.

Function GetDomainName(sUrl)
   Dim oRegex, oMatch, oMatches, oSubMatch

   Set oRegex = New RegExp
   oRegex.Pattern = REGEXPR
   oRegex.IgnoreCase = True
   oRegex.Global = False
   Set oMatches = oRegex.Execute(sUrl)

   If oMatches.Count > 0 Then
       GetDomainName = oMatches(0).SubMatches(0)
   Else
       GetDomainName = ""
   End If
End Function

Dim Data : Data = _
    Array( _
            "xhttp://www.test.com" _
          , "http://www..test.com" _
          , "http://www.test.com." _
          , "http://www.test.com" _
          , "www.test.co.uk/" _
          , "www.test.co.uk/?q=42" _
          , "www.test.info/test-page.html" _
          , "www.test.gov/test-folder/test-page.html" _
          , ".www.test.co.uk/" _
          )

Dim sUrl, sDomainName
For Each sUrl In Data
    sDomainName = GetDomainName(sUrl)

    If sDomainName = "" Then
        WScript.Echo "[ ] [" & sUrl & "]"
    Else
        WScript.Echo "[*] [" & sUrl & "] => [" & sDomainName & "]"
    End If
Next

Expected Output:

[ ] [xhttp://www.test.com]
[ ] [http://www..test.com]
[ ] [http://www.test.com.]
[*] [http://www.test.com] => [www.test.com]
[*] [www.test.co.uk/] => [www.test.co.uk]
[*] [www.test.co.uk/?q=42] => [www.test.co.uk]
[*] [www.test.info/test-page.html] => [www.test.info]
[*] [www.test.gov/test-folder/test-page.html] => [www.test.gov]
[ ] [.www.test.co.uk/]

Upvotes: 2

Jonas Elfström
Jonas Elfström

Reputation: 31438

I haven't coded Classic ASP in 12 years and this is totally untested.

result = "http://" & Split(Replace(url, "http://",""),"/")(0) & "/"

Upvotes: 0

Related Questions