Reputation: 1081
I'm trying to build a JavaScript parser for .ppt files. PPTX is no big deal since it' an "open" format, but I'm really lost regarding the file structure of a .ppt file and can't find any useful information.
Given this, has anyone ever tried this, or can at least point me to where I can see the 'spec' for the .ppt, so I can build the parser?
Best Regards, Celso Santos
Upvotes: 6
Views: 4886
Reputation: 7982
First, in case anyone doesn't know, all Office "X" files (pptx, xlsx, docx), are just zip files! If you rename them to .zip, you can open them in any zip explorer (including Windows 10/11 directly), they contain all the embedded image/sound/xml/etc. files your document uses! Just edit them within the zip and save and Office can't even tell you've edited them.
OK, with that out of the way, Microsoft Office files, including powerpoint files (before pptx) are all in "CFB" (Compound File Binary) format. This used to be custom, but is now just standard zip.
Here is the full spec to power point files. Version 3.0 is for files with a "PPT" extension: https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-ppt/6be79dde-33c1-4c1b-8ccc-4b2301c08662?redirectedfrom=MSDN
Here is the full spec to CFB format. Version 3.0 is what PPT files use: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-cfb/53989ce4-7b05-4f8d-829b-d08d6148375b
PPT or PPTX?
In Microsoft Office, CFB version 4.0 and above use the "X" at the end of the file extensions, and are open-sourced, however, Microsoft decided to share version 3.0 as well. Version 3.0 is the format used by all Office files without the "x" extension, i.e. ppt, doc, xls, msg.
Version 1 & 2 are depreciated and Microsoft has never published their format and does not intend to. It was essentially a custom/primitive zip format.
What is CFB/OLE/COM?
CFB is also knows as OLE (Object Linking and Embedding) or COM (Component Object Model) format. You may have seen the terms "OLE" or "COM" if you ever wrote a windows app or paid attention to the install messages while installing Windows 3.x or 9x. Microsoft also did a lot of advertizing in the early 90s. This is the same thing PPT files use! :)
How does CFB work?
CFB has changed a lot since v3/PPT, but 3.0 generally works like this:
If you're familiar with the FAT file system, you'll be right at home with CFB and PPT file format. They are very similar in many ways.
This is how PPT and other office formats can store multiple things within them such as multiple pictures, charts, etc.; they simply store multiple files along with their main file.
All CFB files must contain 1 main file. That file can then have children, and the children can have children, and so on. The main file within a CFB is always the main record in question, in PPT, that is your presentation itself. Child files may be pictures or other embeds.
If you just want to read a PPT
The npm package ppt
(https://www.npmjs.com/package/ppt) can read ppt format and output text any powerpoint contains. Use it like this from the command prompt:
Install...
npm i ppt
Usage...
ppt test.ppt
(will return the text of the entire presentation)
View the source code here: https://github.com/SheetJS/js-ppt
If you want to extract all the files (like images) contained within a PPT or other office file (like DOC/XLS/MSG)
https://www.npmjs.com/package//compound-binary-file-js
Upvotes: 4
Reputation: 16472
.ppt is a binary file format. You can read the 1997-2007 spec here
Not to discourage you from trying, but you should note that this may wind up being a daunting/almost impossible task for 1 developer to implement since the entire spec represents thousands of programming hours over 10 years.
Joel Spolsky has a good article on dealing with these file formats.
Just for completion sake, here is the spec for the pptx file format.
Upvotes: 11