fifamaniac04
fifamaniac04

Reputation: 2383

Is there a glossary of Word .docx XML tags?

I'm trying to create a parser to find the tracked changes and author of a Word .docx file...

I found the document.xml but there are so many tags! Is there a glossary somewhere to what all those tags stand for?

I'd like to avoid brute forcing my way through this if possible.

Upvotes: 14

Views: 9907

Answers (4)

fifamaniac04
fifamaniac04

Reputation: 2383

"w:ins" denotes what was inserted when trackedchanges are enabled.
"w:del" denotes what was deleted when  trackedchanges are enabled.
"w:commentRangeStart" denotes the start of a comment
"w:commentRangeEnd" denotes the end of the comment.

All text are found inside 
"w:t" tags.

Upvotes: 2

JasonPlutext
JasonPlutext

Reputation: 15863

You can use my docx4j webapp, specifically http://webapp.docx4java.org/OnlineDemo/PartsList.html

With that you can click on a tag and it will take you to the corresponding definition in the spec.

Upvotes: 1

imhotap
imhotap

Reputation: 2490

The "Office Open XML" format and its XML vocabularies are described in detail in http://www.ecma-international.org/publications/standards/Ecma-376.htm .

To give you an idea, the following piece of XSLT should extract just the effective result text without tracked deletions of a wordprocessingML document, like would be stored under word/document.xml in a .docx file (a ZIP archive).

<!-- Match and output text spans except when
     appearing in w:delText child content -->
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <xsl:output method="text"/>
  <xsl:template match="w:t">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="w:delText"/>
  <xsl:template match="*">
    <xsl:apply-templates/>
  </xsl:template>
</xsl:stylesheet>

For your application to extract changes instead, you'd also have to take care of w:ins elements.

Upvotes: 1

Pac0
Pac0

Reputation: 23149

You can start gathering information about it in the Stack Overflow docx tag wiki itself .

.docx files (as well as other new MS Office files like .xlsx) use OOXML format


In particular :

Microsoft Office Open XML WordProcessingML is mostly standardized in ECMA 376 and ISO 29500.

You can get the relevant ECMA standard specification here : http://www.ecma-international.org/news/TC45_current_work/TC45_available_docs.htm

The specific document you are probably looking for is probably the Open Office XML, Part 4 : Markup Language Reference

But of course... this is huge (5219 pages !)

I strongly recommend to pinpoint the functionalities you want, and have a look at existing open source libraries that already do some of the job you want to do.

Upvotes: 4

Related Questions