Reputation: 1492
I am working with the LangChain library and am interested in whether it is possible to load file creation and/or modification dates together with file content with DirectoryLoader and add that information to the documents' metadata. Is it possible? How to do that?
Currently, I load only docx files, but I would also like to load other documents in the future. My current code snippet is:
loader = DirectoryLoader(dir, glob="**/*.docx", show_progress=True, silent_errors=True)
docs = loader.load()
Upvotes: 0
Views: 758
Reputation: 1492
After some research, I found the following but not optimal solution. I reimplemented DateDirectoryLoader
's load_file
to add date metadata for newly loaded documents.
class DateDirectoryLoader(DirectoryLoader):
def load_file(
self, item: Path, path: Path, docs: List[Document], pbar: Optional[Any]
) -> None:
prev_len = len(docs)
super().load_file(item, path, docs, pbar)
if len(docs) > prev_len:
# if any file was loaded by super().load_file == no error loading
stat = os.stat(str(item))
creation_date = datetime.fromtimestamp(stat.st_ctime).isoformat()
modification_date = datetime.fromtimestamp(stat.st_mtime).isoformat()
for doc in docs[prev_len:]:
doc.metadata['creation_date'] = creation_date
doc.metadata['modification_date'] = modification_date
Important notice: stat.st_ctime
is the creation date only in Windows and the metadata modification name on Unix. Look for a solution that works on multiple operating systems here: How do I get file creation and modification date/times?
Upvotes: 0