PovilasZ
PovilasZ

Reputation: 301

How to decode HTML into string?

I need to decode HTML into plain text. I know that there are a lot of questions like this but I noticed one problem with those solutions and don't know how to solve it.

For example we have this piece of HTML: <h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>

Tried regex solutions, HttpUtility.HtmlDecode method. And all of them give this output: Some textSome more text. Words get connected where they should be separate. Is there a way to decode string without merging words?

Upvotes: 2

Views: 630

Answers (4)

Drag and Drop
Drag and Drop

Reputation: 2734

It's not clear what separator you wan between things that were not separated in the first place. So I used NewLine \n.
Where(x=>!string.IsNullOrWhiteSpace(x) will remove the empty element that will result in a lot of \n\n in more complex html doc

var input = "<h1><strong>Some text</strong></h1><p><br></p><p>Some more text</p>";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);

var result = string.Join(
                "\n", 
                htmlDocument
                    .DocumentNode
                    .ChildNodes
                    .Select(x=> x.InnerText)
                    .Where(x=>!string.IsNullOrWhiteSpace(x))
              );

Result:

"Some text\nSome more text"

Upvotes: 4

Hasitha
Hasitha

Reputation: 152

You can use something as follows. In this sample i have used new line to separate inner text, hope you can adapt this to suite your scenario.

public static string GetPlainTextFromHTML(string inputText)
    {
        // Extracted plain text
        var plainText = string.Empty;

        if(string.IsNullOrWhiteSpace(inputText))
        {
            return plainText;
        }

        var htmlNote = new HtmlDocument();
        htmlNote.LoadHtml(inputText);

        var nodes = htmlNote.DocumentNode.ChildNodes;
        if(nodes == null)
        {
            return plainText;
        }

        StringBuilder innerString = new StringBuilder();

        // Replace <p> with new lines
        foreach (HtmlNode node in nodes) 
        {
            innerString.Append(node.InnerText);
            innerString.Append("\\n");
        }

        plainText = innerString.ToString();
        return plainText;
    }

Upvotes: 0

Or Yaacov
Or Yaacov

Reputation: 3880

easy way to do it is to use HTML Agility pack:

HtmlDocument htmlDocument= new HtmlDocument();
htmlDocument.Load(htmlString);
string res=htmlDocument.DocumentNode.SelectSingleNode("YOUR XPATH TO THE INTRESTING ELEMENT").InnerText

Upvotes: 2

Magnus Dot
Magnus Dot

Reputation: 1

You can use a regex : <(div|/div|br|p|/p)[^>]{0,}>

Upvotes: -1

Related Questions