Reputation: 186
i still try to develope a function that extract from an HTML text all Headings (h1,h2,h3,..) with a id specified to construct a Table of contents.
I've made a simple script using regex but for some strange reason it collect only 1 match (the last one)
here my sample code:
Function RegExResults(strTarget, strPattern)
dim regEx
Set regEx = New RegExp
regEx.Pattern = strPattern
regEx.Global = True
regEx.IgnoreCase = True
regEx.Multiline = True
Set RegExResults = regEx.Execute(strTarget)
Set regEx = Nothing
End Function
htmlstr = "<h1>Documentation</h1><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p><h3 id=""one"">How do you smurf a murf?</h3><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet, ante. Donec eu libero sit amet quam egestas semper.</p><h3 id=""two"">How do many licks does a giraffe?</h3><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p>"
regpattern = "<h([1-9]).*id=\""(.*)\"">(.*)</h[1-9]>"
set arrayresult = RegExResults(htmlstr,regpattern)
For each result in arrayresult
response.write "count: " & arrayresult.count & "<br><hr>"
response.write "0: " & result.Submatches(0) & "<br>"
response.write "1: " & result.Submatches(1) & "<br>"
response.write "2: " & result.Submatches(2) & "<br>"
Next
I need to extract all headings plus for each one know what kind of heading is (1..9) and the id value to use for jump to the right title paragraph (#ID_value).
I hope someone can help me to find out why this not working as intended.
Thank you
Upvotes: 0
Views: 294
Reputation: 16950
The .*
's in the pattern are greedy but you need laziness to collect every possible match. Instead you should use .*?
's.
With some improvements, the pattern could be something like below.
regpattern = "<(h[1-9]).*?id=""(.*?)"">(.*?)</\1>"
' \1 means the same as the 1st group
' backslash (\) is redundant to escape double quotes, so removed it
I'd strongly recommend you to have a look at Repetition with Star and Plus. It's very useful article to understand lazy and greedy repetitions in Regex.
Oh, I almost forgot, You can't parse HTML with Regex, well you should not at least.
Upvotes: 1