Reputation: 4077
I have successfully read .docx
files using ElementTree package using zipfile
. But I realized that there isn't the archive 'word/document.xml
'for .doc
files . I looked into the docs but did not find any. How can it be read?
For docx, i used :
import zipfile as zf
import xml.etree.ElementTree as ET
z = zf.ZipFile("test.docx")
doc_xml = z.open('word/document.xml')
tree = ET.parse(doc_xml)
Using the above for .doc gives :
KeyError: "There is no item named 'word/document.xml' in the archive"
I saw something for read in ElementTree docs but that is for xml files only.
doc_xml = open('yesblue.doc','r')
How should go about this one? Maybe something like converting .doc
into .docx
in python itself.
Edit: The .doc format stores data in binary and XML cannot be used for it.
Upvotes: 2
Views: 4903
Reputation: 4077
After some serious searching ,I realized that it would be better to use the comtypes package to convert it from .doc
to .docx
format. This has its own set of disadvantages like Windows exclusivity
and the need for Microsoft Office installed.
import sys
import os
import comtypes.client
in_file = os.path.abspath('')
out_file = os.path.abspath('yesblue') #name of output file added to the current working directory
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open('yesblue.doc') #name of input file
doc.SaveAs(out_file, FileFormat=16) # output file format to Office word Xml default (code=16)
doc.Close()
word.Quit()
The list of codes are contained here.
The output docx
file can be used for further processing in ElementTree.
Upvotes: 4