Reputation: 2383
I'm trying to create a parser to find the tracked changes and author of a Word .docx
file...
I found the document.xml
but there are so many tags! Is there a glossary somewhere to what all those tags stand for?
I'd like to avoid brute forcing my way through this if possible.
Upvotes: 14
Views: 9907
Reputation: 2383
"w:ins" denotes what was inserted when trackedchanges are enabled.
"w:del" denotes what was deleted when trackedchanges are enabled.
"w:commentRangeStart" denotes the start of a comment
"w:commentRangeEnd" denotes the end of the comment.
All text are found inside
"w:t" tags.
Upvotes: 2
Reputation: 15863
You can use my docx4j webapp, specifically http://webapp.docx4java.org/OnlineDemo/PartsList.html
With that you can click on a tag and it will take you to the corresponding definition in the spec.
Upvotes: 1
Reputation: 2490
The "Office Open XML" format and its XML vocabularies are described in detail in http://www.ecma-international.org/publications/standards/Ecma-376.htm .
To give you an idea, the following piece of XSLT should extract just the effective result text without tracked deletions of a wordprocessingML document, like would be stored under word/document.xml
in a .docx file (a ZIP archive).
<!-- Match and output text spans except when
appearing in w:delText child content -->
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<xsl:output method="text"/>
<xsl:template match="w:t">
<xsl:value-of select="."/>
</xsl:template>
<xsl:template match="w:delText"/>
<xsl:template match="*">
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
For your application to extract changes instead, you'd also have to take care of w:ins
elements.
Upvotes: 1
Reputation: 23149
You can start gathering information about it in the Stack Overflow docx tag wiki itself .
.docx
files (as well as other new MS Office files like .xlsx
) use OOXML format
In particular :
Microsoft Office Open XML WordProcessingML is mostly standardized in ECMA 376 and ISO 29500.
You can get the relevant ECMA standard specification here : http://www.ecma-international.org/news/TC45_current_work/TC45_available_docs.htm
The specific document you are probably looking for is probably the Open Office XML, Part 4 : Markup Language Reference
But of course... this is huge (5219 pages !)
I strongly recommend to pinpoint the functionalities you want, and have a look at existing open source libraries that already do some of the job you want to do.
Upvotes: 4