Khronos
Khronos

Reputation: 381

Replacing HTML with VBScript Regex

I'm trying to use Regex in VBScript to replace a HTML tag that has the class 'candidate' with the text 'PLACEHOLDER'. However, it's not always working.

<[^\>]*class=""[^\>]*candidate[^\>]*""[^\>]*>([\s\S]*?)</[^\>]*>

Flags: IgnoreCase = True, Multiline = True, Global = True

The issue is that I'm not sure what type of HTML tags will contain this class (e.g. It might be a < div > tag or a < p > tag). Secondly the Regex doesn't work particularly well with inner HTML tags.

Subject HTML:

<div class="outer">
<div class="normal">
<p><strong><em>Test</em></strong></p>
</div>
<div class="candidate">
<p>Test 1:</p>
<ul>
    <li>Test 2</li>
    <li>Test 3 </li>
    <li>Test 4 </li>
</ul>
<p>Test 5</p>
</div>
<p>Test 6</p>
<div class="normal">
<p><strong>Test 7</strong></p>
</div>
</div>

Expected:

<div class="outer">
<div class="normal">
<p><strong><em>Test</em></strong></p>
</div>
<div class="candidate">
PLACEHOLDER
</div>
<p>Test 6</p>
<div class="normal">
<p><strong>Test 7</strong></p>
</div>
</div>

Actual:

<div class="outer">
<div class="normal">
<p><strong><em>Test</em></strong></p>
</div>
<div class="candidate">
PLACEHOLDER
    <li>Test 2</li>
    <li>Test 3 </li>
    <li>Test 4 </li>
</ul>
<p>Test 5</p>
</div>
<p>Test 6</p>
<div class="normal">
<p><strong>Test 7</strong></p>
</div>
</div>

The same HTML tag may also have inner tags with the same type but different classes which is currently sporadically working.

e.g:

<div class="candidate">Test<div class="normal">Test</div></div>

Any help would very greatly be appreciated.

Upvotes: 2

Views: 1378

Answers (1)

Cheran Shunmugavel
Cheran Shunmugavel

Reputation: 8459

Does it have to be a regular expression? The task is really easy using MSHTML (or any other HTML parser). In this example, I put your subject HTML in a file called "test.htm":

Option Explicit

Const ForReading = 1

Dim fso
Set fso = CreateObject("Scripting.FileSystemObject")
Dim inFile
Set inFile = fso.OpenTextFile("test.htm", ForReading)

Dim html
Set html = CreateObject("htmlfile")
html.write inFile.ReadAll()
inFile.Close

Dim allElements
Set allElements = html.getElementsByTagName("*")

Dim el
For Each el in allElements
    If (HasClass(el, "candidate")) Then
        el.innerText = "PLACEHOLDER"
    End If
Next

WScript.Echo html.body.outerHtml

' Takes into account the fact that the HTML "class" attribute can
' contain multiple whitespace-delimited classes
Function HasClass(el, className)
    Dim re
    Set re = New RegExp

    re.Pattern = "\b" & className & "\b"
    HasClass = re.Test(el.className)
End Function

Upvotes: 3

Related Questions