anshabhi
anshabhi

Reputation: 423

What makes Microsoft-Word-generated HTML documents so large in code?

Below is a simple W3C-validated code to print "Hello World":

<!DOCTYPE html>
<html>
<head>
<meta charset = "utf-8">
<title>Hello</title>
</head>
Hello World
</html> 

But when I do the same thing with MS Word, the code generated is of 449 lines Why do all these extra lines appear in the code?

Upvotes: 3

Views: 2043

Answers (3)

St&#233;phane GRILLON
St&#233;phane GRILLON

Reputation: 11862

Name space of Word:

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">

Word keep meta datas informations:

<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>xxxxxx</o:Author>
  <o:LastAuthor>xxxxx</o:LastAuthor>
  <o:Revision>2</o:Revision>
  <o:TotalTime>0</o:TotalTime>
  <o:Created>2015-05-25T11:40:00Z</o:Created>
  <o:LastSaved>2015-05-25T11:40:00Z</o:LastSaved>
  <o:Pages>1</o:Pages>
  <o:Words>1</o:Words>
  <o:Characters>11</o:Characters>
  <o:Company>Sopra Group</o:Company>
  <o:Lines>1</o:Lines>
  <o:Paragraphs>1</o:Paragraphs>
  <o:CharactersWithSpaces>11</o:CharactersWithSpaces>
  <o:Version>12.00</o:Version>
 </o:DocumentProperties>
</xml><![endif]-->

Word add a css style:

<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;
    mso-font-charset:0;
    mso-generic-font-family:roman;
    mso-font-pitch:variable;
    mso-font-signature:-536870145 1107305727 0 0 415 0;}
@font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;
    mso-font-charset:0;
    mso-generic-font-family:swiss;
    mso-font-pitch:variable;
    mso-font-signature:-536870145 1073786111 1 0 415 0;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
    {mso-style-unhide:no;
    mso-style-qformat:yes;
    mso-style-parent:"";
    margin-top:0cm;
    margin-right:0cm;
    margin-bottom:10.0pt;
    margin-left:0cm;
    line-height:115%;
    mso-pagination:widow-orphan;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";
    mso-ascii-font-family:Calibri;
    mso-ascii-theme-font:minor-latin;
    mso-fareast-font-family:Calibri;
    mso-fareast-theme-font:minor-latin;
    mso-hansi-font-family:Calibri;
    mso-hansi-theme-font:minor-latin;
    mso-bidi-font-family:"Times New Roman";
    mso-bidi-theme-font:minor-bidi;
    mso-fareast-language:EN-US;}
.MsoChpDefault
    {mso-style-type:export-only;
    mso-default-props:yes; ......

Word use the css style:

<p class=MsoNormal>Hello World</p>

You need to keep this information if you need to modify it in future. If you are doing a simple export, you can delete all metadatas.

Upvotes: 13

ShadowScripter
ShadowScripter

Reputation: 7369

As explained in this link, the code is added for MS Office purposes; and among other things, it's meant to make it easier for you to resume editing the document in Word. Most of the bloat you're seeing is just layout and document information, I gather. I'll post the relevant quote for future reference in case of link rot.

[...] Turns out these HTML files were created by Microsoft Word! Due of a series of different web designs and designers over a number of years, as well as a healthy bit of editing by the marketing department, 1 in 4 web pages of our client’s current website were created or modified using Microsoft Word!

As we scrolled through the HTML file we saw large amounts of extra data that no normal web browser would ever interpret. A little research explained it for us. Microsoft allows you to save a document as an HTML file. They also want you to be able to open an HTML file that was created using Microsoft Office and resume editing it just like a normal document. Since Microsoft Office has all sorts of features that HTML and CSS doesn’t this allows Office to preserve certain information inside the HTML file between edits.

The some of the data stored is obvious: when the document was created and by whom, who made what edits when, paragraph count, etc. Other less obvious data such as VML, DHTML behaviors, column and page spacing, Word styling information, embedded objects data, and more is also stored inside the file. All of this Office specific data is stored inside HTML file and is wrapped inside of special conditional comments such as <!--[if gte mso 9]. This hides the content from other programs that read the HTML.

As Adriano Repetti pointed out, there's some code to handle older versions of Office.

<!--[if gte mso 9]> ...
<!--[if gte mso 10]> ...

Checks compatibility for MS Office versions to determine layout. Should probably mention that editing HTML in Word is not something I'd recommend. Ever.

Try out NetBeans, it's free and awesome :)
I sound like a car salesman... * grumbles *

Upvotes: 7

user4563161
user4563161

Reputation:

The extra code you see consist's of:

  1. The fontface link to the font used.
  2. O information (Document Properties), Which stores information such as the author, date word, count etc.
  3. Word Doc Settings & Math, this includes things like kerning (space between letters), Language its in and a host of other settings generally related to page & content layout.

Ultimately this all effects what you see on page so that it looks similar to your word doc and retains the background information such as word counting and such.

Upvotes: 1

Related Questions