Cain
Cain

Reputation:

C# code to convert XHTML doc to plain text

I'm writing a utility to export evernote notes into Outlook on a schedule. The Outlook API's need plain text, and Evernote outputs a XHTML doc version of the plain text note. What I need is to strip out all the Tags and unescape the source XHTML doc embedded in the Evernote export file.

Basically I need to turn;

<note>
  <title>Test Sync Note 1</title> 
  <content>
  <![CDATA[ <?xml version="1.0" encoding="UTF-8"?>
   <!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml.dtd">

<en-note bgcolor="#FFFFFF">
<div>Test Sync Note 1</div>
<div>This i has some text in it</div>
<div>&nbsp;</div>
<div>&nbsp;</div>
<div>and a second line</div>
</en-note>

  ]]> 
  </content>
  <created>20081028T045727Z</created> 
  <updated>20081028T051346Z</updated> 
  <tag>Test</tag> 
</note>

Into


    Test Sync Note 1
    This i has some text in it


    and a second line

I can easily parse out the CDATA section and get just the 4 lines of text, but I need a reliable way to strip the div's, unescape and deal with any extra HTML that might have snuck in there.

I'm assuming that there's some MS API combo that will do the job, but I don't know it.

Upvotes: 0

Views: 4987

Answers (5)

Sunny Milenov
Sunny Milenov

Reputation: 22310

You can use HTML Agility Pack.

Upvotes: 1

Rune Grimstad
Rune Grimstad

Reputation: 36330

You can also use an xslt transformation to convert the xml into a text document.

Upvotes: 1

Xian
Xian

Reputation: 76601

I would use a regular expression to strip out all the HTML tags, this one is pretty basic, I am sure if you may be able to tweak it if it doesn't work as you exactly want.

Regex.Replace("<div>your html in here</div>",@"<(.|\n)*?>",string.Empty);

Upvotes: 1

Vinko Vrsalovic
Vinko Vrsalovic

Reputation: 340316

        string xml = @"<note>
          <title>Test Sync Note 1</title> 
          <content>
          <![CDATA[ <?xml version=""1.0"" encoding=""UTF-8""?>
           <!DOCTYPE en-note SYSTEM ""http://xml.evernote.com/pub/enml.dtd"">

        <en-note bgcolor=""#FFFFFF"">
        <div>Test Sync Note 1</div>
        <div>This i has some text in it</div>
        <div> </div>
        <div> </div>
        <div>and a second line</div>
        </en-note>

          ]]> 
          </content>
          <created>20081028T045727Z</created> 
          <updated>20081028T051346Z</updated> 
          <tag>Test</tag> 
        </note>
        ";
        XPathDocument doc = new XPathDocument(new StringReader(xml));
        XPathNavigator nav = doc.CreateNavigator();

        // Compile a standard XPath expression

        XPathExpression expr;
        expr = nav.Compile("/note/content");
        XPathNodeIterator iterator = nav.Select(expr);

        // Iterate on the node set

        try
        {
            while (iterator.MoveNext())
            {
                //Get the XML in the CDATA
                XPathNavigator nav2 = iterator.Current.Clone();
                XPathDocument doc2 = new XPathDocument(new StringReader(nav2.Value.Trim()));

                //Parse the XML in the CDATA
                XPathNavigator nav3 = doc2.CreateNavigator();
                expr = nav3.Compile("/en-note");
                XPathNodeIterator iterator2 = nav3.Select(expr);
                iterator2.MoveNext();
                XPathNavigator nav4 = iterator2.Current.Clone();

                //Output the value directly, does not preserve the formatting
                Console.WriteLine("Direct Try:");
                Console.WriteLine(nav4.Value);

                //This works, but is ugly
                Console.WriteLine("Ugly Try:");
                Console.WriteLine(nav4.InnerXml.Replace("<div>","").Replace("</div>",Environment.NewLine));
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }

Upvotes: 0

Andrew Kennan
Andrew Kennan

Reputation: 14157

As far as I know there isn't anything to do that specific job but you might want to look at using XSLT or walking through an IXPathNavigable.

Upvotes: 0

Related Questions