Vlad
Vlad

Reputation: 795

How to strip all HTML tags and entities and get CLEAR text in Visual Basic?

How can one who is using Visual Basic (in my case 6), strip all HTML tags and get plain text? I was able to accomplish this with HTML Purifier, but in PHP. Is there a function or a class or a script in VB6 that lets me do this, as I need to process pages over 5MBs and it's really not that efficient in PHP.

So, again, how do I convert this:

<!DOCTYPE html>
<html>
<head>
<title>Title</title>
</head>
<body>
<p>Paragraph 1</p>
<div>Section</div>
Hello!
</body>
</html>

To, let's say this:

Paragraph 1
Section
Hello!

I wanted to make an API system to do this, but found out that it's not going to be reliable.

P.S.: I am doing this as I am making a crawler for my search engine, and I only have experience in VB and PHP.

Thanks in advanced.

Upvotes: 1

Views: 7749

Answers (4)

user4914655
user4914655

Reputation:

I know this thread is old, but I wrote this today. It isn't elegant, but it works great.

    Public Function RemoveHTML(HTMLstring As String) As String

        IF NOT HTMLstring.contains("<") THEN return HTMLstring

        Dim DoRec As Boolean = False
        Dim textOut As String = ""

        Dim SkipMe As Boolean = False
        Dim SkipMeTag As String = ""

        For l = 1 To HTMLstring.Length
            Dim tmp As String = Mid(HTMLstring, l, 1)

            ' Enable skip-me mode (for large blocks of non-readable code)
            If tmp = "<" And Mid(HTMLstring, l + 1, 6) = "script" Then SkipMe = True : SkipMeTag = "script" : DoRec = False
            If tmp = "<" And Mid(HTMLstring, l + 1, 5) = "style" Then SkipMe = True : SkipMeTag = "style" : DoRec = False

            ' If we're already in skip-me mode, then figure out iff it's time to exit it.
            If SkipMe = True Then
                If tmp = "<" And Mid(HTMLstring, l + 1, Len(SkipMeTag) + 1) = "/" + SkipMeTag Then
                    SkipMe = False
                    tmp = ""
                    l = l + Len(SkipMeTag) + 1
                    DoRec = False
                End If
            End If

            ' If we arent in skip-me mode, move on to handle parsing of the HTML content (pulling text out from in between tags)
            If SkipMe = False Then
                If tmp = ">" Then DoRec = True : textOut &= " " : tmp = ""
                If tmp = "<" Then DoRec = False : tmp = ""

                If DoRec = True Then
                    textOut &= tmp
                End If
            End If

        Next

        Return textOut
    End Function

Upvotes: 2

Mahdi Jazini
Mahdi Jazini

Reputation: 791

@Matth3w Code was great but was not Compatible With (VB6 - Visual Basic 6)

I have downgraded his code to vb6 and also added some useful extra codes to it

1) If your HTML Texts contains Unicode (UTF-8) characters, Add (Microsoft Forms 2 Object Library) and use its (Textbox) for the (input and output)

2) Add 2 Textboxes and 1 Command Button

3) Set the textboxes properties: (MultiLine=true) (change font to Tahoma or something which is not: Ms Sans Serif) (Scrollbars: 3)

4) Paste the following code to the code area:

Private Sub Command1_Click()
    TextBox2.Text = RemoveHTML(TextBox1.Text)
End Sub

Public Function RemoveHTML(HTMLstring As String) As String

    Dim DoRec As Boolean
    Dim textOut As String

    Dim SkipMe As Boolean
    Dim SkipMeTag As String

    Dim tmp As String

    HTMLstring = Replace(LCase(HTMLstring), "</p>", vbCrLf)
    HTMLstring = Replace(LCase(HTMLstring), "<br>", vbCrLf)
    HTMLstring = Replace(LCase(HTMLstring), "<br/>", vbCrLf)
    HTMLstring = Replace(LCase(HTMLstring), "&zwnj;", " ")
    HTMLstring = Replace(LCase(HTMLstring), "&nbsp;", " ")

    HTMLstring = Replace(LCase(HTMLstring), "&sect;", "-")
    HTMLstring = Replace(LCase(HTMLstring), "&ndash;", "-")
    HTMLstring = Replace(LCase(HTMLstring), "&mdash;", "-")
    HTMLstring = Replace(LCase(HTMLstring), "&rlm;", "")
    HTMLstring = Replace(LCase(HTMLstring), "&ldquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&rdquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&lsquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&rsquo;", ChrW(34))

    HTMLstring = Replace(LCase(HTMLstring), "&laquo;", ChrW(34))
    HTMLstring = Replace(LCase(HTMLstring), "&raquo;", ChrW(34))

    For l = 1 To Len(HTMLstring)
        tmp = Mid(HTMLstring, l, 1)

        ' Enable skip-me mode (for large blocks of non-readable code)
        If tmp = "<" And Mid(HTMLstring, l + 1, 6) = "script" Then SkipMe = True: SkipMeTag = "script": DoRec = False
        If tmp = "<" And Mid(HTMLstring, l + 1, 5) = "style" Then SkipMe = True: SkipMeTag = "style": DoRec = False

        ' If we're already in skip-me mode, then figure out iff it's time to exit it.
        If SkipMe = True Then
            If tmp = "<" And Mid(HTMLstring, l + 1, Len(SkipMeTag) + 1) = "/" + SkipMeTag Then
                SkipMe = False
                tmp = ""
                l = l + Len(SkipMeTag) + 1
                DoRec = False
            End If
        End If

        ' If we arent in skip-me mode, move on to handle parsing of the HTML content (pulling text out from in between tags)
        If SkipMe = False Then
            If tmp = ">" Then DoRec = True: textOut = textOut & " ": tmp = ""
            If tmp = "<" Then DoRec = False: tmp = ""

            If DoRec = True Then
                textOut = textOut & tmp
            End If
        End If

    Next

    RemoveHTML = textOut
End Function

(To Support Persian Language) To change Old Farsi ي To New One ی You can add this line:

HTMLstring = Replace(LCase(HTMLstring), ChrW(1610), ChrW(1740))

Update: There is an important bug in this function. if it doesn't find any html tag in your variable, it returns empty value! To be safe, use something like this condition:

if len(RemoveHTML(variable))>0 then variable=RemoveHTML(variable)

Upvotes: 2

Bob77
Bob77

Reputation: 13267

Considering how flawed most HTML you find can be, I find it much easier to use a technique like that described in HTML Parsing? Tidy it up first.

The cleaned up HTML is then suitable for parsing using any of several techniques, from loading it into an XML DOM, to using a SAX parser, to hand-coded parsing, to regular expressions (if you insist on making your life and the lives of any maintainers who come after you difficult).

If your documents are of reasonably small size the DOM is the easy way to go. After loading the cleaned HTML as XML you can simply walk the node tree extracting any non-empty text properties. It is easy to use an exclusion list of nodeName or baseName values for tags to be ignored.

Upvotes: 1

Jimmi
Jimmi

Reputation: 49

i have a snipped for C# ... but you can port it to VB very easy :)

/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
            return Regex.Replace(source, "<.*?>", string.Empty);
}

Upvotes: 1

Related Questions