antonis lambrianides
antonis lambrianides

Reputation: 423

Eliminate all content between <> tags

I have a set of a very large text in html format, and I want to eliminate all the content between the tags <><>. At some point at the outerside of the tags there is a name, which I want to get, like <a><a><a>name<a/><a/><a/>. I wrote a small program where I insert the long text I have, and by pressing a button, I get the text, substring the start till the next > and then if the content.Length from the last > until the next < is greater than 1, then I insert it into a list.

Basically I want to get all the names that are not in the <> tags and put them in a list. There are my contestants in a competition.

The thing is the code doesn't work exactly as I hoped. It needs a small tweek and I'm a little confused.

Here is my code:

Private Sub btnPickWinner_Click(sender As Object, e As EventArgs) Handles btnPickWinner.Click
    Dim vContestant As String = ""
    Dim vAllText As String = tbxText.Text
    Dim vList As New List(Of String)
    Dim vLength As Integer
    Dim vStrToCountLength As String
    While vAllText.Length > 1
        vStrToCountLength = vAllText.Substring(0, vAllText.IndexOf(">"))
        vLength = vAllText.Length - vStrToCountLength.Length
        vAllText = vAllText.Substring(vAllText.IndexOf(">"), vLength)
        vContestant = vAllText.Substring(0, vAllText.IndexOf("<"))
        If (vContestant.Length > 1) Then
            vList.Add(vContestant)
        End If
    End While
End Sub

Here is a small sample of the text I have:

<div class="_6a"><div class="_6a _6b" style="height:50px"></div><div class="_6a _6b"><div class="fsl fwb fcb"><a href="abcde.com/?fref=pb&amp;hc_location=profile_browser"; data-gt="{&quot;engagement&quot;:{&quot;eng_type&quot;:&quot;1&quot;,&quot;eng_s‌​rc&quot;:&quot;2&quot;,&quot;eng_tid&quot;:&quot;673597072&quot;,&quot;eng_data&q‌​uot;:[]}}" data-hovercard="/ajax/hovercard/user.php?id=673597072&amp;extragetparams=%7B%22h‌​c_location%22%3A%22profile_browser%22%7D">Antonis Lambr</a></div></div></div></div></div></div></li> 

So I only want to get the name "Antonis Lambr". The text I have is more than a million characters so I just pasted a very small sample here...

Upvotes: 0

Views: 50

Answers (1)

Tim Schmelter
Tim Schmelter

Reputation: 460360

You should not use string methods or regex to parse HTML. Instead use a library like HtmlAgilityPack:

Dim html = "<div class=""_6a""><div class=""_6a _6b"" style=""height:50px""></div><div class=""_6a _6b""><div class=""fsl fwb fcb""><a href=""abcde.com/?fref=pb&amp;hc_location=profile_browser""; data-gt=""{&quot;engagement&quot;:{&quot;eng_type&quot;:&quot;1&quot;,&quot;eng_s‌​rc&quot;:&quot;2&quot;,&quot;eng_tid&quot;:&quot;673597072&quot;,&quot;eng_data&q‌​uot;:[]}}"" data-hovercard=""/ajax/hovercard/user.php?id=673597072&amp;extragetparams=%7B%22h‌​c_location%22%3A%22profile_browser%22%7D"">Antonis Lambr</a></div></div></div></div></div></div></li>"
Dim doc = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(html)

Dim anchorsTexts = From a In doc.DocumentNode.SelectNodes("//a[@href]")
                   Select a.InnerText
Dim anchorTextList = anchorsTexts.ToList()

or with this syntax:

Dim anchorsTexts = From a In doc.DocumentNode.Descendants("a")
                   Where Not String.IsNullOrEmpty(a.GetAttributeValue("href", ""))
                   Select a.InnerText
Dim anchorTextList = anchorsTexts.ToList()

The list contains a single string Antonis Lambr which is the anchor-text.

Upvotes: 2

Related Questions