Reputation: 11
Trying to create embeddings from .md files but DirectoryLoader is stuck. This works for pdf files but not for .md.
I am using the below code to create a vector db in chroma, this works perfectly when using the commented loader to read pdfs but when using the current uncommented line to read .md files it just stops
this is inside a method in a class
# loader = DirectoryLoader(self.data_directory+'/', glob="./*.pdf", loader_cls=PyPDFLoader)
loader = DirectoryLoader(self.data_directory+'/', glob="./*.md",loader_cls=UnstructuredMarkdownLoader)
print("loader")
documents = loader.load()
print("loaded? ")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
print("split? ")
vectordb = Chroma.from_documents(documents=texts,
embedding=self.embedding,
persist_directory=self.persist_directory)
print("persisted? ")
retriever = vectordb.as_retriever()
return retriever
the last thing that is printed is "loader" and "loaded?" is not printed. any obvious problem i am missing?
loader
0 [main] python (11932) c:\python312\python.exe: *** fatal error - Internal error: TP_NUM_C_BUFS too small: 50
315 [main] python (11932) c:\python312\python.exe: *** fatal error - Internal error: TP_NUM_C_BUFS too small: 50
The above output sometime appears when I try to load the .md file. no idea what this is.
Upvotes: 0
Views: 716
Reputation: 2470
To load markdown documents, use UnstructuredMarkdownLoader
function, as below example
markdown_path = "../README.md"
loader = UnstructuredMarkdownLoader(markdown_path)
Documentation reference > https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/markdown/
Upvotes: 0
Reputation: 11
Installing the below solved it somehow
pip install python-magic python-magic-bin
found the solution here : some other question
Upvotes: 0