Reputation: 6478
I'm looking for an efficient means of extracting an html "fragment" from an html document. My first implementation of this used the Html Agility Pack. This appeared to be a reasonable way to attack this problem, until I started running the extraction on large html documents - performance was very poor for something so trivial (I'm guessing due to the amount of time it was taking to parse the entire document).
Can anyone suggest a more efficient means of achieving my goal?
To summarize:
For my purposes, an html "fragment"
is defined as all content inside of
the <body>
tags of an html
document
Ideally, I'd like to return the
content unaltered if it didn't
contain an <html>
or <body>
(I'll assume I was passed an html
fragment to begin with)
I have the entire html document available in memory (as a string), I won't be streaming it on demand - so a potential solution won't need to worry about that.
Performance is critical, so a potential solution should account for this.
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
A solution in C# or VB.NET would be welcome.
Upvotes: 2
Views: 921
Reputation: 44275
Most html is not going to be XHTML compliant. I would do an HTTP get request and search the resultant text for .Contains("<body>")
and .Contains("</body>")
. You can use these two locations as your start and stop indexes for a reader stream. Outside the body tag you really don't need to worry about XML compliance.
Upvotes: 2
Reputation: 13947
If I remember correctly, I did something similar in the past with an XPathNavigator
. I think it looked something like this:
XPathDocument xDoc = new System.Xml.XPath.XPathDocument(new StringReader(content));
XPathNavigator xNav = xDoc.CreateNavigator();
XPathNavigator node = xNav.SelectSingleNode("/body");
where you could change /body
to whatever you need to look for.
Upvotes: 0
Reputation: 101594
You could hack it using a WebBrowse control and take advantage of webBrowser1.document
property (though not sure what you're trying to accomplish).
Upvotes: 0