Reputation: 13
I've currently got a string which contains a URL, and I need to get the base URL.
The string I have is http://www.test.com/test-page/category.html
I am looking for a RegEx that will effectively remove any page/folder names at the end. The issue is that some people may enter the domain in the following formats:
http://www.test.com
www.test.co.uk/
www.test.info/test-page.html
www.test.gov/test-folder/test-page.html
It must return http://www.websitename.ext/ each time i.e. the domain name and extension (e.g. .info .com .co.uk etc) with a forward slash at the end.
Effectively it needs to return the base URL, without any page/folder names. Is there any easy way to do with with a Regular Expression?
Thanks.
Upvotes: 0
Views: 3400
Reputation: 19457
My approach: Use a RegEx to extract the domain name. Then add http:
to the front and /
to the end. Here's the RegEx:
^(?:http:\/\/)?([\w_]+(?:\.[\w_]+)+)(?=(?:\/|$))
Also see this answer to the question Extract root domain name from string. (It left me somewhat disatisfied, although pointed out the need to account for https
, the port number, and user authentication info which my RegEx does not do.)
Here is an implementation in VBScript. I put the RegEx in a constant and defined a function named GetDomainName()
. You should be able to incorporate that function in your ASP page like this:
normalizedUrl = "http://" & GetDomainName(url) & "/"
You can also test my script from the command prompt by saving the code to a file named test.vbs
and then passing it to cscript
:
cscript test.vbs
Test Program
Option Explicit
Const REGEXPR = "^(?:http:\/\/)?([\w_]+(?:\.[\w_]+)+)(?=(?:\/|$))"
' ^^^^^^^^^ ^^^^^^ ^^^^^^^^^^ ^^^^
' A B1 B2 C
'
' A - An optional 'http://' scheme
' B1 - Followed by one or more alpha-numeric characters
' B2 - Followed optionally by one or more occurences of a string
' that begins with a period that is followed by
' one or more alphanumeric characters, and
' C - Terminated by a slash or nothing.
Function GetDomainName(sUrl)
Dim oRegex, oMatch, oMatches, oSubMatch
Set oRegex = New RegExp
oRegex.Pattern = REGEXPR
oRegex.IgnoreCase = True
oRegex.Global = False
Set oMatches = oRegex.Execute(sUrl)
If oMatches.Count > 0 Then
GetDomainName = oMatches(0).SubMatches(0)
Else
GetDomainName = ""
End If
End Function
Dim Data : Data = _
Array( _
"xhttp://www.test.com" _
, "http://www..test.com" _
, "http://www.test.com." _
, "http://www.test.com" _
, "www.test.co.uk/" _
, "www.test.co.uk/?q=42" _
, "www.test.info/test-page.html" _
, "www.test.gov/test-folder/test-page.html" _
, ".www.test.co.uk/" _
)
Dim sUrl, sDomainName
For Each sUrl In Data
sDomainName = GetDomainName(sUrl)
If sDomainName = "" Then
WScript.Echo "[ ] [" & sUrl & "]"
Else
WScript.Echo "[*] [" & sUrl & "] => [" & sDomainName & "]"
End If
Next
Expected Output:
[ ] [xhttp://www.test.com]
[ ] [http://www..test.com]
[ ] [http://www.test.com.]
[*] [http://www.test.com] => [www.test.com]
[*] [www.test.co.uk/] => [www.test.co.uk]
[*] [www.test.co.uk/?q=42] => [www.test.co.uk]
[*] [www.test.info/test-page.html] => [www.test.info]
[*] [www.test.gov/test-folder/test-page.html] => [www.test.gov]
[ ] [.www.test.co.uk/]
Upvotes: 2
Reputation: 31438
I haven't coded Classic ASP in 12 years and this is totally untested.
result = "http://" & Split(Replace(url, "http://",""),"/")(0) & "/"
Upvotes: 0