Reputation: 6122
I want to remove all elements, including the ones with attributes like class
, from my string.
I already checked here, so regex is apparently not the answer: RegEx match open tags except XHTML self-contained tags
I currently already have something with regex that replaces all tags from a string (note, I'm never parsing a full HTML document if that matters) and preserves the content: Regex.Replace(s, "<[^>]*(>|$)", String.Empty)
. However, I just want the div
tags removed and preserve the content.
So I have:
<div class=""fade-content""><div><span>some content</span></div></div>
<div>some content</div>
Desired output:
<span>some content</span>
some content
I was going the regex path stil, and trying something like: <div>.*<\/div>
, but that excludes divs with attributes.
How can I remove div
elements only, using VB.NET?
Upvotes: 0
Views: 379
Reputation: 4983
This can be achieved without regular expressions by using a WebBrowser control. Try the following:
ExtractDesiredData:
Private Function ExtractDesiredData(html As String) As List(Of String)
Dim result As List(Of String) = New List(Of String)()
'create new instance
Using wb As WebBrowser = New WebBrowser()
wb.Navigate(New Uri("about:blank"))
'create reference
Dim doc As HtmlDocument = wb.Document
'add html to document
doc.Write(html)
'loop through body elements
For Each elem As HtmlElement In doc.Body.All
If elem.TagName = "DIV" AndAlso Not elem.InnerHtml.Contains("DIV") Then
Debug.WriteLine($"DIV elem InnerHtml: '{elem.InnerHtml}'")
'add
result.Add(elem.InnerHtml)
End If
Next
End Using
Return result
End Function
Usage:
Dim html As String = "<div class=""fade-content""><div><span>some content</span></div></div>"
html &= vbCrLf & "<div>some content</div>"
Dim desiredData As List(Of String) = ExtractDesiredData(html)
Resources:
Upvotes: 0
Reputation: 1173
There are several ways to do this. One, short and simple, is the following one:
Regex.Replace(s, "</?div.*?>", String.Empty)
Here is an example:
's simulates your html file
Dim s As String = "<div class="""" fade-content""""><div><span>some content</span></div></div>" + Environment.NewLine + "<div>some content</div>"
'let's store the result in s1
Dim s1 As String = Text.RegularExpressions.Regex.Replace(s, "</?div.*?>", String.Empty)
'output
MessageBox.Show(s1)
Output:
Upvotes: 3