Reputation: 795
How can one who is using Visual Basic (in my case 6), strip all HTML tags and get plain text? I was able to accomplish this with HTML Purifier, but in PHP. Is there a function or a class or a script in VB6 that lets me do this, as I need to process pages over 5MBs and it's really not that efficient in PHP.
So, again, how do I convert this:
<!DOCTYPE html>
<html>
<head>
<title>Title</title>
</head>
<body>
<p>Paragraph 1</p>
<div>Section</div>
Hello!
</body>
</html>
To, let's say this:
Paragraph 1
Section
Hello!
I wanted to make an API system to do this, but found out that it's not going to be reliable.
P.S.: I am doing this as I am making a crawler for my search engine, and I only have experience in VB and PHP.
Thanks in advanced.
Upvotes: 1
Views: 7749
Reputation:
I know this thread is old, but I wrote this today. It isn't elegant, but it works great.
Public Function RemoveHTML(HTMLstring As String) As String
IF NOT HTMLstring.contains("<") THEN return HTMLstring
Dim DoRec As Boolean = False
Dim textOut As String = ""
Dim SkipMe As Boolean = False
Dim SkipMeTag As String = ""
For l = 1 To HTMLstring.Length
Dim tmp As String = Mid(HTMLstring, l, 1)
' Enable skip-me mode (for large blocks of non-readable code)
If tmp = "<" And Mid(HTMLstring, l + 1, 6) = "script" Then SkipMe = True : SkipMeTag = "script" : DoRec = False
If tmp = "<" And Mid(HTMLstring, l + 1, 5) = "style" Then SkipMe = True : SkipMeTag = "style" : DoRec = False
' If we're already in skip-me mode, then figure out iff it's time to exit it.
If SkipMe = True Then
If tmp = "<" And Mid(HTMLstring, l + 1, Len(SkipMeTag) + 1) = "/" + SkipMeTag Then
SkipMe = False
tmp = ""
l = l + Len(SkipMeTag) + 1
DoRec = False
End If
End If
' If we arent in skip-me mode, move on to handle parsing of the HTML content (pulling text out from in between tags)
If SkipMe = False Then
If tmp = ">" Then DoRec = True : textOut &= " " : tmp = ""
If tmp = "<" Then DoRec = False : tmp = ""
If DoRec = True Then
textOut &= tmp
End If
End If
Next
Return textOut
End Function
Upvotes: 2
Reputation: 791
@Matth3w Code was great but was not Compatible With (VB6 - Visual Basic 6)
I have downgraded his code to vb6 and also added some useful extra codes to it
1) If your HTML Texts contains Unicode (UTF-8) characters, Add (Microsoft Forms 2 Object Library) and use its (Textbox) for the (input and output)
2) Add 2 Textboxes and 1 Command Button
3) Set the textboxes properties: (MultiLine=true) (change font to Tahoma or something which is not: Ms Sans Serif) (Scrollbars: 3)
4) Paste the following code to the code area:
Private Sub Command1_Click()
TextBox2.Text = RemoveHTML(TextBox1.Text)
End Sub
Public Function RemoveHTML(HTMLstring As String) As String
Dim DoRec As Boolean
Dim textOut As String
Dim SkipMe As Boolean
Dim SkipMeTag As String
Dim tmp As String
HTMLstring = Replace(LCase(HTMLstring), "</p>", vbCrLf)
HTMLstring = Replace(LCase(HTMLstring), "<br>", vbCrLf)
HTMLstring = Replace(LCase(HTMLstring), "<br/>", vbCrLf)
HTMLstring = Replace(LCase(HTMLstring), "‌", " ")
HTMLstring = Replace(LCase(HTMLstring), " ", " ")
HTMLstring = Replace(LCase(HTMLstring), "§", "-")
HTMLstring = Replace(LCase(HTMLstring), "–", "-")
HTMLstring = Replace(LCase(HTMLstring), "—", "-")
HTMLstring = Replace(LCase(HTMLstring), "‏", "")
HTMLstring = Replace(LCase(HTMLstring), "“", ChrW(34))
HTMLstring = Replace(LCase(HTMLstring), "”", ChrW(34))
HTMLstring = Replace(LCase(HTMLstring), "‘", ChrW(34))
HTMLstring = Replace(LCase(HTMLstring), "’", ChrW(34))
HTMLstring = Replace(LCase(HTMLstring), "«", ChrW(34))
HTMLstring = Replace(LCase(HTMLstring), "»", ChrW(34))
For l = 1 To Len(HTMLstring)
tmp = Mid(HTMLstring, l, 1)
' Enable skip-me mode (for large blocks of non-readable code)
If tmp = "<" And Mid(HTMLstring, l + 1, 6) = "script" Then SkipMe = True: SkipMeTag = "script": DoRec = False
If tmp = "<" And Mid(HTMLstring, l + 1, 5) = "style" Then SkipMe = True: SkipMeTag = "style": DoRec = False
' If we're already in skip-me mode, then figure out iff it's time to exit it.
If SkipMe = True Then
If tmp = "<" And Mid(HTMLstring, l + 1, Len(SkipMeTag) + 1) = "/" + SkipMeTag Then
SkipMe = False
tmp = ""
l = l + Len(SkipMeTag) + 1
DoRec = False
End If
End If
' If we arent in skip-me mode, move on to handle parsing of the HTML content (pulling text out from in between tags)
If SkipMe = False Then
If tmp = ">" Then DoRec = True: textOut = textOut & " ": tmp = ""
If tmp = "<" Then DoRec = False: tmp = ""
If DoRec = True Then
textOut = textOut & tmp
End If
End If
Next
RemoveHTML = textOut
End Function
(To Support Persian Language) To change Old Farsi ي To New One ی You can add this line:
HTMLstring = Replace(LCase(HTMLstring), ChrW(1610), ChrW(1740))
Update: There is an important bug in this function. if it doesn't find any html tag in your variable, it returns empty value! To be safe, use something like this condition:
if len(RemoveHTML(variable))>0 then variable=RemoveHTML(variable)
Upvotes: 2
Reputation: 13267
Considering how flawed most HTML you find can be, I find it much easier to use a technique like that described in HTML Parsing? Tidy it up first.
The cleaned up HTML is then suitable for parsing using any of several techniques, from loading it into an XML DOM, to using a SAX parser, to hand-coded parsing, to regular expressions (if you insist on making your life and the lives of any maintainers who come after you difficult).
If your documents are of reasonably small size the DOM is the easy way to go. After loading the cleaned HTML as XML you can simply walk the node tree extracting any non-empty text
properties. It is easy to use an exclusion list of nodeName
or baseName
values for tags to be ignored.
Upvotes: 1
Reputation: 49
i have a snipped for C# ... but you can port it to VB very easy :)
/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}
Upvotes: 1