Reputation: 301
I need to decode HTML into plain text. I know that there are a lot of questions like this but I noticed one problem with those solutions and don't know how to solve it.
For example we have this piece of HTML:
<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>
Tried regex solutions, HttpUtility.HtmlDecode method. And all of them give this output: Some textSome more text
. Words get connected where they should be separate. Is there a way to decode string without merging words?
Upvotes: 2
Views: 630
Reputation: 2734
It's not clear what separator you wan between things that were not separated in the first place. So I used NewLine \n
.
Where(x=>!string.IsNullOrWhiteSpace(x)
will remove the empty element that will result in a lot of \n\n
in more complex html doc
var input = "<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);
var result = string.Join(
"\n",
htmlDocument
.DocumentNode
.ChildNodes
.Select(x=> x.InnerText)
.Where(x=>!string.IsNullOrWhiteSpace(x))
);
Result:
"Some text\nSome more text"
Upvotes: 4
Reputation: 152
You can use something as follows. In this sample i have used new line to separate inner text, hope you can adapt this to suite your scenario.
public static string GetPlainTextFromHTML(string inputText)
{
// Extracted plain text
var plainText = string.Empty;
if(string.IsNullOrWhiteSpace(inputText))
{
return plainText;
}
var htmlNote = new HtmlDocument();
htmlNote.LoadHtml(inputText);
var nodes = htmlNote.DocumentNode.ChildNodes;
if(nodes == null)
{
return plainText;
}
StringBuilder innerString = new StringBuilder();
// Replace <p> with new lines
foreach (HtmlNode node in nodes)
{
innerString.Append(node.InnerText);
innerString.Append("\\n");
}
plainText = innerString.ToString();
return plainText;
}
Upvotes: 0
Reputation: 3880
easy way to do it is to use HTML Agility pack:
HtmlDocument htmlDocument= new HtmlDocument();
htmlDocument.Load(htmlString);
string res=htmlDocument.DocumentNode.SelectSingleNode("YOUR XPATH TO THE INTRESTING ELEMENT").InnerText
Upvotes: 2