Reputation: 16920
I have a requirement to extract all the text that is present in the <body>
of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)
Upvotes: 6
Views: 13643
Reputation: 3772
You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)
Upvotes: 2
Reputation: 28134
How about using the XPath expression '//body//text()'
to select all text nodes?
Upvotes: 3
Reputation: 138137
You can use the body's InnerText
:
string html = @"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, @"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world
or hello<i>world</i>
will be converted by InnerText
to helloworld
- removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.
Upvotes: 5
Reputation: 2283
Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.
Upvotes: 1