Reputation: 3345
I am trying to get all the text between the following tags and it is just not workind
If Not String.IsNullOrEmpty(_html) Then
Dim regex As Regex = New Regex( _
".*<entry(?<link>.+)</entry>", _
RegexOptions.IgnoreCase _
Or RegexOptions.CultureInvariant _
Or RegexOptions.Multiline _
)
Dim ms As MatchCollection = regex.Matches(_html)
Dim url As String = String.Empty
For Each m As Match In ms
url = m.Groups("link").Value
urls.Add(url)
Next
Return urls
I have already wrote my fetch functions to get the html as string. I was looking at an example of the html agility pack and I dont have files saved as html docs
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
Upvotes: 2
Views: 161
Reputation: 416159
The best way to do this in .Net is via the HTML Agility Pack. Using regular expressions on html is not usually a good idea.
The exceptions are situations where you can make certain assumptions about the structure of the html, such as one-off jobs (where you can study the actual input for your program) or when the html is generated by a trusted source. For example, can you assume that the html is well-formed or that tags will not be nested beyond a certain depth? (Note that neither of those assumptions by themselves are good enough to build an expression that won't fall down given some edge case or other.)
If you meet this criteria we need to know exactly what assumptions you are allowed to make before we can write an accurate expression.
Upvotes: 2
Reputation: 1868
I would use this software to help with your regexes.
Free RegExBuilder software.
Upvotes: 4
Reputation: 85126
Obligatory "don't use regex to parse HTML" warning:
Using regex to parse HTML has been covered at length on SO. Please read the following post:
RegEx match open tags except XHTML self-contained tags
Would it be possible to convert your HTML to XHTML and parse it using xpath?
Using a tool like HTML Tidy or SGML you can do this conversion. Then you could use xpath to extract the desired data: //entry/link
Upvotes: 1