achen17
achen17

Reputation: 57

How to extract metadata from docx file using Python?

How would I extract metadata (e.g. FileSize, FileModifyDate, FileAccessDate) from a docx file?

Upvotes: 3

Views: 8530

Answers (3)

John Bonfardeci
John Bonfardeci

Reputation: 476

Here's a reusable and concise method using the solutions above.

import os
from typing import Dict
import docx
from docx.document import Document
from docx.opc.coreprops import CoreProperties

def get_docx_metadata(docpath:str) -> Dict:
    filename = os.path.basename(docpath)
    doc:Document = docx.Document(docpath)
    props:CoreProperties = doc.core_properties
    metadata = {str(p):getattr(props, p) for p in dir(props) if not str(p).startswith('_')}
    metadata['filepath'] = docpath
    metadata['filename'] = filename
    return metadata

Upvotes: 1

Joe
Joe

Reputation: 913

Same solution as previous answer - just a little less typing.

import os
import docx

path = '\Your\Path'
os.chdir(path)

fname = 'your.docx'
doc = docx.Document(fname)

prop = doc.core_properties
            
metadata = {}
for d in dir(prop):
    if not d.startswith('_'):
        metadata[d] = getattr(prop, d)
            
print(metadata)

Upvotes: 3

Hit2Mo
Hit2Mo

Reputation: 160

You could use python-docx. python-docx has a method core_properties you can utilise. This method gives 15 metadata attributes such as author, category, etc.
See the below code to extract some of the metadata into a python dictionary:

import docx

def getMetaData(doc):
    metadata = {}
    prop = doc.core_properties
    metadata["author"] = prop.author
    metadata["category"] = prop.category
    metadata["comments"] = prop.comments
    metadata["content_status"] = prop.content_status
    metadata["created"] = prop.created
    metadata["identifier"] = prop.identifier
    metadata["keywords"] = prop.keywords
    metadata["last_modified_by"] = prop.last_modified_by
    metadata["language"] = prop.language
    metadata["modified"] = prop.modified
    metadata["subject"] = prop.subject
    metadata["title"] = prop.title
    metadata["version"] = prop.version
    return metadata

doc = docx.Document(file_path)
metadata_dict = getMetaData(doc)

Upvotes: 5

Related Questions