Reputation: 381
I'm trying to use Regex in VBScript to replace a HTML tag that has the class 'candidate' with the text 'PLACEHOLDER'. However, it's not always working.
<[^\>]*class=""[^\>]*candidate[^\>]*""[^\>]*>([\s\S]*?)</[^\>]*>
Flags: IgnoreCase = True, Multiline = True, Global = True
The issue is that I'm not sure what type of HTML tags will contain this class (e.g. It might be a < div > tag or a < p > tag). Secondly the Regex doesn't work particularly well with inner HTML tags.
Subject HTML:
<div class="outer">
<div class="normal">
<p><strong><em>Test</em></strong></p>
</div>
<div class="candidate">
<p>Test 1:</p>
<ul>
<li>Test 2</li>
<li>Test 3 </li>
<li>Test 4 </li>
</ul>
<p>Test 5</p>
</div>
<p>Test 6</p>
<div class="normal">
<p><strong>Test 7</strong></p>
</div>
</div>
Expected:
<div class="outer">
<div class="normal">
<p><strong><em>Test</em></strong></p>
</div>
<div class="candidate">
PLACEHOLDER
</div>
<p>Test 6</p>
<div class="normal">
<p><strong>Test 7</strong></p>
</div>
</div>
Actual:
<div class="outer">
<div class="normal">
<p><strong><em>Test</em></strong></p>
</div>
<div class="candidate">
PLACEHOLDER
<li>Test 2</li>
<li>Test 3 </li>
<li>Test 4 </li>
</ul>
<p>Test 5</p>
</div>
<p>Test 6</p>
<div class="normal">
<p><strong>Test 7</strong></p>
</div>
</div>
The same HTML tag may also have inner tags with the same type but different classes which is currently sporadically working.
e.g:
<div class="candidate">Test<div class="normal">Test</div></div>
Any help would very greatly be appreciated.
Upvotes: 2
Views: 1378
Reputation: 8459
Does it have to be a regular expression? The task is really easy using MSHTML (or any other HTML parser). In this example, I put your subject HTML in a file called "test.htm":
Option Explicit
Const ForReading = 1
Dim fso
Set fso = CreateObject("Scripting.FileSystemObject")
Dim inFile
Set inFile = fso.OpenTextFile("test.htm", ForReading)
Dim html
Set html = CreateObject("htmlfile")
html.write inFile.ReadAll()
inFile.Close
Dim allElements
Set allElements = html.getElementsByTagName("*")
Dim el
For Each el in allElements
If (HasClass(el, "candidate")) Then
el.innerText = "PLACEHOLDER"
End If
Next
WScript.Echo html.body.outerHtml
' Takes into account the fact that the HTML "class" attribute can
' contain multiple whitespace-delimited classes
Function HasClass(el, className)
Dim re
Set re = New RegExp
re.Pattern = "\b" & className & "\b"
HasClass = re.Test(el.className)
End Function
Upvotes: 3