Primoz
Primoz

Reputation: 1492

Add file creation and modification date to metadata with DirectoryLoader

I am working with the LangChain library and am interested in whether it is possible to load file creation and/or modification dates together with file content with DirectoryLoader and add that information to the documents' metadata. Is it possible? How to do that?

Currently, I load only docx files, but I would also like to load other documents in the future. My current code snippet is:

loader = DirectoryLoader(dir, glob="**/*.docx", show_progress=True, silent_errors=True)
docs = loader.load()

Upvotes: 0

Views: 758

Answers (1)

Primoz
Primoz

Reputation: 1492

After some research, I found the following but not optimal solution. I reimplemented DateDirectoryLoader's load_file to add date metadata for newly loaded documents.

class DateDirectoryLoader(DirectoryLoader):
    def load_file(
        self, item: Path, path: Path, docs: List[Document], pbar: Optional[Any]
    ) -> None:
        prev_len = len(docs)
        super().load_file(item, path, docs, pbar)
        if len(docs) > prev_len:
            # if any file was loaded by super().load_file == no error loading
            stat = os.stat(str(item))
            creation_date = datetime.fromtimestamp(stat.st_ctime).isoformat()
            modification_date = datetime.fromtimestamp(stat.st_mtime).isoformat()
            for doc in docs[prev_len:]:
                doc.metadata['creation_date'] = creation_date
                doc.metadata['modification_date'] = modification_date

Important notice: stat.st_ctime is the creation date only in Windows and the metadata modification name on Unix. Look for a solution that works on multiple operating systems here: How do I get file creation and modification date/times?

Upvotes: 0

Related Questions