I'm looking for an efficient means of extracting an html "fragment" from an html document. My first implementation of this used the Html Agility Pack. This appeared to be a reasonable way to attack this problem, until I started running the extraction on large html documents - performance was very poor for something so trivial (I'm guessing due to the amount of time it was taking to parse the entire document). Can anyone suggest a more efficient means of achieving my goal? To summarize: For my purposes, an html "fragment" is defined as all content inside of the <body> tags of an html document Ideally, I'd like to return the content unaltered if it didn't contain an <html> or <body> (I'll assume I was passed an html fragment to begin with) I have the entire html document available in memory (as a string), I won't be streaming it on demand - so a potential solution won't need to worry about that. Performance is critical, so a potential solution should account for this. Sample Input: <html> <head> <title>blah</title> </head> <body> <p>My content</p> </body> </html> Desired Output: <p>My content</p> A solution in C# or VB.NET would be welcome.

Reputation: 6478

Extracting an html fragment from an html document

I'm looking for an efficient means of extracting an html "fragment" from an html document. My first implementation of this used the Html Agility Pack. This appeared to be a reasonable way to attack this problem, until I started running the extraction on large html documents - performance was very poor for something so trivial (I'm guessing due to the amount of time it was taking to parse the entire document).

Can anyone suggest a more efficient means of achieving my goal?

To summarize:

For my purposes, an html "fragment" is defined as all content inside of the <body> tags of an html document
Ideally, I'd like to return the content unaltered if it didn't contain an <html> or <body> (I'll assume I was passed an html fragment to begin with)
I have the entire html document available in memory (as a string), I won't be streaming it on demand - so a potential solution won't need to worry about that.
Performance is critical, so a potential solution should account for this.

Sample Input:

<html>
   <head>
     <title>blah</title>
   </head>
   <body>
    <p>My content</p>
   </body>
</html>

Desired Output:

<p>My content</p>

A solution in C# or VB.NET would be welcome.

Upvotes: 2

Answers (3)

P.Brian.Mackey

Reputation: 44275

Most html is not going to be XHTML compliant. I would do an HTTP get request and search the resultant text for .Contains("<body>") and .Contains("</body>"). You can use these two locations as your start and stop indexes for a reader stream. Outside the body tag you really don't need to worry about XML compliance.

Upvotes: 2

Mark Avenius

Reputation: 13947

If I remember correctly, I did something similar in the past with an XPathNavigator. I think it looked something like this:

        XPathDocument xDoc = new System.Xml.XPath.XPathDocument(new StringReader(content));
        XPathNavigator xNav = xDoc.CreateNavigator();
        XPathNavigator node = xNav.SelectSingleNode("/body");

where you could change /body to whatever you need to look for.

Upvotes: 0

Brad Christie

Reputation: 101594

You could hack it using a WebBrowse control and take advantage of webBrowser1.document property (though not sure what you're trying to accomplish).

Upvotes: 0

Extracting an html fragment from an html document

Answers (3)

Related Questions