Reputation: 868
So I am loading some remote content and need to use regex to isolate the the content of some tags.
set xmlhttp = CreateObject("MSXML2.ServerXMLHTTP")
xmlhttp.open "GET", url, false
xmlhttp.setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
xmlhttp.setRequestHeader "Accept-Language", "en-us"
xmlhttp.send "x=hello"
status = xmlhttp.status
if err.number <> 0 or status <> 200 then
if status = 404 then
Response.Write "[EFERROR]Page does not exist (404)."
elseif status >= 401 and status < 402 then
Response.Write "[EFERROR]Access denied (401)."
elseif status >= 500 and status <= 600 then
Response.Write "[EFERROR]500 Internal Server Error on remote site."
else
Response.write "[EFERROR]Server is down or does not exist."
end if
else
data = xmlhttp.responseText
I basically need to get the content of the <title>Here is the title</title>
also the meta description, keywords and some selected open graph meta data.
And finally I need to get the content of the first <h1>Heading</h1>
and <p>Paragraph</p>
How can I parse the html data to get these things? Should I use regex?
Upvotes: 0
Views: 3493
Reputation: 1835
You may be able to use the .responseXML property to retrieve the content you want without using regex. Because you are looking for data inside <title>
, <h1>
and <p>
tags, the document returned is probably HTML. If the HTML document is well-formed according to the XML specifications it could mean it is already automatically parsed and accessible after you get the response.
So you could try this:
Dim objData
Set objData = xmlhttp.responseXML.selectSingleNode("//*[local-name() = 'title']")
If objData Is Nothing Then
Response.Write "# no result #<br />"
Else
Response.Write "title: " & objData.Text & "<br />"
End If
Note though, that this XPath expression may not be the most efficient way to query an XML document (in case you want to process large amounts of data).
Upvotes: 1
Reputation: 868
I actually used this solution in the end as it also solve the problem of having class names in the code.
Function GetFirstMatch(PatternToMatch, StringToSearch)
Dim regEx, CurrentMatch, CurrentMatches
Set regEx = New RegExp
regEx.Pattern = PatternToMatch
regEx.IgnoreCase = True
regEx.Global = True
regEx.MultiLine = True
Set CurrentMatches = regEx.Execute(StringToSearch)
GetFirstMatch = ""
If CurrentMatches.Count >= 1 Then
Set CurrentMatch = CurrentMatches(0)
If CurrentMatch.SubMatches.Count >= 1 Then
GetFirstMatch = CurrentMatch.SubMatches(0)
End If
End If
Set regEx = Nothing
End Function
title = clean_str(GetFirstMatch("<title[^>]*>([^<]+)</title>",data))
firstpara = clean_str(GetFirstMatch("<p[^>]*>([^<]+)</p>",data))
firsth1 = clean_str(GetFirstMatch("<h1[^>]*>([^<]+)</h1>",data))
Upvotes: 1
Reputation: 13243
Use the Mid
function combined with the Instr
function. I built a function which uses the Mid
function to determine the tag wrapped text by finding the position of each tag using the Instr
function:
Function GetInnerData(Data,TagOpen,TagClose)
OpenPos = Instr(1,data,TagOpen,1)
ClosePos = Instr(1,data,TagClose,1)
If OpenPos > 0 And ClosePos > 0 Then GetInnerData = Trim(Mid(data,OpenPos+Len(TagOpen),ClosePos-(OpenPos+Len(TagOpen))))
End Function
When you run this function like this, it will return My Title
<%=GetInnerData("any text <title>My Title</title> any text","<title>","</title>")%>
And in your case, You would do it like this:
TitleData = GetInnerData(data,"<title>","</title>")
This will get the content in your <title>
tag.
or
H1Data = GetInnerData(data,"<h1>","</h1>")
This will get the content in your <h1>
tag.
The Instr
function returns the first string found in the data, so this function will do exactly what you need.
Upvotes: 0