user1762489
user1762489

Reputation:

VB.net: Extract and replace all instances of HTML

I am working on manipulating/extracting data from well-formed HTML in one of our legacy systems. I need to use regex to parse the HTML, find certain patterns, extract the data, and return some modified HTML. I know that regex and HTML are never the answer but, given that I know exactly where the data is coming from and that the data is properly structure, I am confident that this will work for the particular situation.

The HTML that I am working with has the following pattern:

<i>Name1</i>: Some text goes here<br/>
<i>Name2</i>: Some different text goes here<br/>
<i>Name3</i>: Some other different text goes here<br/>

I need to change the HTML to the following:

<i>Name1</i><p>Some text goes here</p>
<i>Name2</i><p>Some different text goes here</p>
<i>Name3</i><p>Some other different text goes here</p>

Basically, I want to take the inner text, wrap it in a p tag and then remove the trailing br.

I want to do something like the following:

Dim HTML as String = [The HTML goes here]
html = Regex.Replace(html, "</i>:(.+?)<br\s*\/?>", "</i><p>(.+?)</p>", RegexOptions.Multiline)

but it obviously isn't working.

In VB.net, how do I replace all desired instances of HTML with the new HTML?

Upvotes: 2

Views: 1172

Answers (2)

Oded
Oded

Reputation: 499002

I suggest using the HTML Agility Pack to parse and manipulate HTML (in particular if the format of the HTML is not regular). The source download comes with a bunch of example projects, so you can see how to use it.

In general Regex is not a good solution for parsing HTML.

Upvotes: 2

NakedBrunch
NakedBrunch

Reputation: 49413

Give this a shot:

Dim HTML as String = [The HTML goes here]
Dim evaluator As MatchEvaluator = Function(m As Match)
                                  Return "</i><p>" & m.Groups(1).Value & "</p>"
                                  End Function
html = Regex.Replace(html, "</i>:(.+?)<br\s*\/?>", evaluator, RegexOptions.Multiline)

Upvotes: 1

Related Questions