Reputation: 57
How would I extract metadata (e.g. FileSize, FileModifyDate, FileAccessDate) from a docx file?
Upvotes: 3
Views: 8530
Reputation: 476
Here's a reusable and concise method using the solutions above.
import os
from typing import Dict
import docx
from docx.document import Document
from docx.opc.coreprops import CoreProperties
def get_docx_metadata(docpath:str) -> Dict:
filename = os.path.basename(docpath)
doc:Document = docx.Document(docpath)
props:CoreProperties = doc.core_properties
metadata = {str(p):getattr(props, p) for p in dir(props) if not str(p).startswith('_')}
metadata['filepath'] = docpath
metadata['filename'] = filename
return metadata
Upvotes: 1
Reputation: 913
Same solution as previous answer - just a little less typing.
import os
import docx
path = '\Your\Path'
os.chdir(path)
fname = 'your.docx'
doc = docx.Document(fname)
prop = doc.core_properties
metadata = {}
for d in dir(prop):
if not d.startswith('_'):
metadata[d] = getattr(prop, d)
print(metadata)
Upvotes: 3
Reputation: 160
You could use python-docx
. python-docx
has a method core_properties
you can utilise. This method gives 15 metadata attributes such as author, category, etc.
See the below code to extract some of the metadata into a python dictionary:
import docx
def getMetaData(doc):
metadata = {}
prop = doc.core_properties
metadata["author"] = prop.author
metadata["category"] = prop.category
metadata["comments"] = prop.comments
metadata["content_status"] = prop.content_status
metadata["created"] = prop.created
metadata["identifier"] = prop.identifier
metadata["keywords"] = prop.keywords
metadata["last_modified_by"] = prop.last_modified_by
metadata["language"] = prop.language
metadata["modified"] = prop.modified
metadata["subject"] = prop.subject
metadata["title"] = prop.title
metadata["version"] = prop.version
return metadata
doc = docx.Document(file_path)
metadata_dict = getMetaData(doc)
Upvotes: 5