yoshiserry
yoshiserry

Reputation: 21375

interpreting xml produced by microsoft office word and excel documents

I wish to build a free system to extract the text, text formatting (e.g bold etc) and images contents inside documents such as excel and word.

In my research I have found that the structure of excel (xlsx) and word (docx) documents is defined in xml once you extract the document with a compression utility like 7zip.

I am skilled in VBA, however I have not been able to find an object model (listing ALL objects and methods that can be applied /manipulated for any of:

  1. Excel VBA
  2. Word VBA
  3. Word XML
  4. Excel XML

I know many excel vba objects already however that is just through trial and error and experimentation, and not through reading an object model where the methods/objects are defined!

The problem

I am trying to develop a tool which looks through the xml to find:

  1. The location of any images in the document, both the relative directory (in the Directory / Word / Media folder) and the actual file path, e.g. C:\documents\josh\img1.png
  2. The position of any text in the document (I'm thinking in terms of lines, reading a document from top to bottom, as well as alignment like centre etc) SO I can reproduce the text in the right order.
  3. The formatting applied to the text (bold, some font, some size?

Please help me find an object model or some way to interpret or parse this

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14"><w:body><w:p w:rsidR="001920B6" w:rsidRDefault="001920B6" w:rsidP="001920B6"><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/><w:r><w:rPr><w:noProof/></w:rPr><w:drawing><wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251658240" behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" wp14:anchorId="4B104522" wp14:editId="4A3907E9"><wp:simplePos x="0" y="0"/><wp:positionH relativeFrom="column"><wp:posOffset>0</wp:posOffset></wp:positionH><wp:positionV relativeFrom="paragraph"><wp:posOffset>1209675</wp:posOffset></wp:positionV><wp:extent cx="5943600" cy="3343275"/><wp:effectExtent l="0" t="0" r="0" b="9525"/><wp:wrapTight wrapText="bothSides"><wp:wrapPolygon edited="0"><wp:start x="0" y="0"/><wp:lineTo x="0" y="21538"/><wp:lineTo x="21531" y="21538"/><wp:lineTo x="21531" y="0"/><wp:lineTo x="0" y="0"/></wp:wrapPolygon></wp:wrapTight><wp:docPr id="1" name="Picture 1"/><wp:cNvGraphicFramePr><a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/></wp:cNvGraphicFramePr><a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"><a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:nvPicPr><pic:cNvPr id="0" name="windows.png"/><pic:cNvPicPr/></pic:nvPicPr><pic:blipFill><a:blip r:embed="rId7" cstate="print"><a:extLst><a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}"><a14:useLocalDpi xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" val="0"/></a:ext></a:extLst></a:blip><a:stretch><a:fillRect/></a:stretch></pic:blipFill><pic:spPr><a:xfrm><a:off x="0" y="0"/><a:ext cx="5943600" cy="3343275"/></a:xfrm><a:prstGeom prst="rect"><a:avLst/></a:prstGeom></pic:spPr></pic:pic></a:graphicData></a:graphic><wp14:sizeRelH relativeFrom="page"><wp14:pctWidth>0</wp14:pctWidth></wp14:sizeRelH><wp14:sizeRelV relativeFrom="page"><wp14:pctHeight>0</wp14:pctHeight></wp14:sizeRelV></wp:anchor></w:drawing></w:r><w:r w:rsidR="00327DB9"><w:rPr><w:noProof/></w:rPr><w:t>Plain text</w:t></w:r></w:p><w:p w:rsidR="00327DB9" w:rsidRDefault="00327DB9" w:rsidP="001920B6"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r><w:rPr><w:b/></w:rPr><w:t>bold</w:t></w:r><w:r w:rsidR="0009704D" w:rsidRPr="00327DB9"><w:rPr><w:b/></w:rPr><w:t xml:space="preserve"> text</w:t></w:r></w:p><w:p w:rsidR="00327DB9" w:rsidRPr="00327DB9" w:rsidRDefault="00327DB9" w:rsidP="00327DB9"><w:pPr><w:pStyle w:val="Heading1"/></w:pPr><w:r><w:t>heading</w:t></w:r></w:p><w:p w:rsidR="00327DB9" w:rsidRPr="001920B6" w:rsidRDefault="00327DB9"/><w:sectPr w:rsidR="00327DB9" w:rsidRPr="001920B6"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/><w:cols w:space="720"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>

Questions about the xml

  1. Determining the position of the image (which is at the bottom) relative to the text (which is above it)
  2. How many images there are? Is there one because the picture has an ID or INDEX of 0?

Upvotes: 0

Views: 929

Answers (1)

Christian Sauer
Christian Sauer

Reputation: 10949

Take a look at Office Open XML, that's the xml-structure of all MS-Office documents: http://openxmldeveloper.org/. There is a rahter good ebook which explains the basics: http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2007/08/13/1970.aspx

But bewarE: parsing or interpreting Office Open XML is a extremely huge task, especially in VBA which is ill-suited for this job. There are numerous libraries in C# / VB.net which can read office open xml documents, which would be a better starting point.

Upvotes: 1

Related Questions