Reputation: 423
I have a set of a very large text in html format, and I want to eliminate all the content between the tags <><>
. At some point at the outerside of the tags there is a name, which I want to get, like <a><a><a>name<a/><a/><a/>
.
I wrote a small program where I insert the long text I have, and by pressing a button, I get the text, substring the start till the next >
and then if the content.Length
from the last >
until the next <
is greater than 1, then I insert it into a list.
Basically I want to get all the names that are not in the <>
tags and put them in a list. There are my contestants in a competition.
The thing is the code doesn't work exactly as I hoped. It needs a small tweek and I'm a little confused.
Here is my code:
Private Sub btnPickWinner_Click(sender As Object, e As EventArgs) Handles btnPickWinner.Click
Dim vContestant As String = ""
Dim vAllText As String = tbxText.Text
Dim vList As New List(Of String)
Dim vLength As Integer
Dim vStrToCountLength As String
While vAllText.Length > 1
vStrToCountLength = vAllText.Substring(0, vAllText.IndexOf(">"))
vLength = vAllText.Length - vStrToCountLength.Length
vAllText = vAllText.Substring(vAllText.IndexOf(">"), vLength)
vContestant = vAllText.Substring(0, vAllText.IndexOf("<"))
If (vContestant.Length > 1) Then
vList.Add(vContestant)
End If
End While
End Sub
Here is a small sample of the text I have:
<div class="_6a"><div class="_6a _6b" style="height:50px"></div><div class="_6a _6b"><div class="fsl fwb fcb"><a href="abcde.com/?fref=pb&hc_location=profile_browser"; data-gt="{"engagement":{"eng_type":"1","eng_src":"2","eng_tid":"673597072","eng_data":[]}}" data-hovercard="/ajax/hovercard/user.php?id=673597072&extragetparams=%7B%22hc_location%22%3A%22profile_browser%22%7D">Antonis Lambr</a></div></div></div></div></div></div></li>
So I only want to get the name "Antonis Lambr". The text I have is more than a million characters so I just pasted a very small sample here...
Upvotes: 0
Views: 50
Reputation: 460360
You should not use string methods or regex to parse HTML. Instead use a library like HtmlAgilityPack
:
Dim html = "<div class=""_6a""><div class=""_6a _6b"" style=""height:50px""></div><div class=""_6a _6b""><div class=""fsl fwb fcb""><a href=""abcde.com/?fref=pb&hc_location=profile_browser""; data-gt=""{"engagement":{"eng_type":"1","eng_src":"2","eng_tid":"673597072","eng_data":[]}}"" data-hovercard=""/ajax/hovercard/user.php?id=673597072&extragetparams=%7B%22hc_location%22%3A%22profile_browser%22%7D"">Antonis Lambr</a></div></div></div></div></div></div></li>"
Dim doc = New HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(html)
Dim anchorsTexts = From a In doc.DocumentNode.SelectNodes("//a[@href]")
Select a.InnerText
Dim anchorTextList = anchorsTexts.ToList()
or with this syntax:
Dim anchorsTexts = From a In doc.DocumentNode.Descendants("a")
Where Not String.IsNullOrEmpty(a.GetAttributeValue("href", ""))
Select a.InnerText
Dim anchorTextList = anchorsTexts.ToList()
The list contains a single string Antonis Lambr
which is the anchor-text.
Upvotes: 2