TinMan7757
TinMan7757

Reputation: 148

Convert Docx to html using OpenXml power tools without formatting

I'm using OpenXml Power tools in my project to convert a document (docx) into html, using the code already provided with this sdk it produces an elegant duplicate in html form.(Github link : https://github.com/OfficeDev/Open-Xml-PowerTools/blob/vNext/OpenXmlPowerToolsExamples/HtmlConverter01/HtmlConverter01.cs )

However looking at the html markup, the html has embedded styling.

Is there any way of turning this off and using plain and simple <h1> and <p> tags ?

I would like to know this embedded styling as the formatting would be taken care of by bootstrap.

The embedded styling is as follows :

 <p dir="ltr" style="font-family: Calibri;font-size: 11pt;line-height: 115.0%;margin-bottom: 0;margin-left: 0;margin-right: 0;margin-top: 0;">
 <span xml:space="preserve" style="font-size: 11pt;font-style: normal;font-weight: normal;margin: 0;padding: 0;"> </span>
 </p>

This as you can see is fine if you want a direct copy, but not if you want to control the style yourself.

In the C# code i have already made the following ajustments :

Many thanks.

Upvotes: 1

Views: 7795

Answers (2)

TinMan7757
TinMan7757

Reputation: 148

I have solved this with a hint from Xiaoy312...

with the following, while using the example above the resulting html string can be loaded into the html agility pack, like so ...

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlString);

Then looking for the attribues (style and any others) remove them.

var styles = htmlDoc.DocumentNode.SelectNodes("//@style");                    
if (styles != null)
{
foreach (var item in styles)
{
item.Attributes["style"].Remove();
}
}

and then save the file.

var fileName = Path.Combine(outputDirectory,"index.html");
htmlDoc.Save(new FileStream(fileName,FileMode.Create,FileAccess.ReadWrite));

There will be other ways of doing this, but seems like an acceptable work around.

EDIT:

After some experimenting with both answers posted here, i found this implementation to work the best as it does not have an issue with images.

 var body = htmlDoc.DocumentNode.SelectSingleNode("//body");
 var tags = body.SelectNodes("//*");
 if (tags != null)
 {
  foreach (var tag in tags){
      if (!tag.OuterHtml.Contains("img"))
      {
       tag.Attributes.RemoveAll();
      }
    }
  }

In theory you can also use this for tables, however depending on the styling you want you can always strip out the attributes generated by power tools and replace the attributes with your own.

Upvotes: 0

Xiaoy312
Xiaoy312

Reputation: 14477

If you can also the XmlReader and XmlWriter to obtain a bare bone html. This could however be a little overkill, as only the tag itself and its text content will be kept.

public static class HtmlHelper
{
    /// <summary>
    /// Keep only the openning and closing tag, and text content from the html
    /// </summary>
    public static string CleanUp(string html)
    {
        var output = new StringBuilder();
        using (var reader = XmlReader.Create(new StringReader(html)))
        {
            var settings = new XmlWriterSettings() { Indent = true, OmitXmlDeclaration = true };
            using (var writer = XmlWriter.Create(output, settings))
            {
                while (reader.Read())
                {
                    switch (reader.NodeType)
                    {
                        case XmlNodeType.Element:
                            writer.WriteStartElement(reader.Name);
                            break;
                        case XmlNodeType.Text:
                            writer.WriteString(reader.Value);
                            break;
                        case XmlNodeType.EndElement:
                            writer.WriteFullEndElement();
                            break;
                    }
                }
            }
        }

        return output.ToString();
    }
}

Resulting output :

<p>
  <span></span>
</p>

Upvotes: 2

Related Questions