Hypothetical Ninja
Hypothetical Ninja

Reputation: 4077

Reading a .doc extension file ,ElementTree

I have successfully read .docx files using ElementTree package using zipfile. But I realized that there isn't the archive 'word/document.xml'for .doc files . I looked into the docs but did not find any. How can it be read? For docx, i used :

import zipfile as zf
import xml.etree.ElementTree as ET
z = zf.ZipFile("test.docx")
doc_xml = z.open('word/document.xml')
tree = ET.parse(doc_xml)

Using the above for .doc gives :

KeyError: "There is no item named 'word/document.xml' in the archive"

I saw something for read in ElementTree docs but that is for xml files only.

doc_xml = open('yesblue.doc','r')  

How should go about this one? Maybe something like converting .doc into .docx in python itself.

Edit: The .doc format stores data in binary and XML cannot be used for it.

Upvotes: 2

Views: 4903

Answers (1)

Hypothetical Ninja
Hypothetical Ninja

Reputation: 4077

After some serious searching ,I realized that it would be better to use the comtypes package to convert it from .doc to .docx format. This has its own set of disadvantages like Windows exclusivity and the need for Microsoft Office installed.

import sys
import os
import comtypes.client
in_file = os.path.abspath('')
out_file = os.path.abspath('yesblue') #name of output file added to the current working directory 
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open('yesblue.doc') #name of input file
doc.SaveAs(out_file, FileFormat=16)  # output file format to Office word Xml default (code=16)
doc.Close()
word.Quit()    

The list of codes are contained here.

The output docx file can be used for further processing in ElementTree.

Upvotes: 4

Related Questions