ranguy
ranguy

Reputation: 11

langchain DirectoryLoader stuck when reading .md files

Trying to create embeddings from .md files but DirectoryLoader is stuck. This works for pdf files but not for .md.

I am using the below code to create a vector db in chroma, this works perfectly when using the commented loader to read pdfs but when using the current uncommented line to read .md files it just stops

this is inside a method in a class

# loader = DirectoryLoader(self.data_directory+'/', glob="./*.pdf", loader_cls=PyPDFLoader)
loader = DirectoryLoader(self.data_directory+'/', glob="./*.md",loader_cls=UnstructuredMarkdownLoader)
print("loader")
documents = loader.load()
print("loaded? ")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)
print("split? ")
vectordb = Chroma.from_documents(documents=texts, 
                                        embedding=self.embedding,
                                        persist_directory=self.persist_directory)
print("persisted? ")

retriever = vectordb.as_retriever()
return retriever

the last thing that is printed is "loader" and "loaded?" is not printed. any obvious problem i am missing?

loader
      0 [main] python (11932) c:\python312\python.exe: *** fatal error - Internal error: TP_NUM_C_BUFS too small: 50
    315 [main] python (11932) c:\python312\python.exe: *** fatal error - Internal error: TP_NUM_C_BUFS too small: 50

The above output sometime appears when I try to load the .md file. no idea what this is.

Upvotes: 0

Views: 716

Answers (2)

j3ffyang
j3ffyang

Reputation: 2470

To load markdown documents, use UnstructuredMarkdownLoader function, as below example

markdown_path = "../README.md"
loader = UnstructuredMarkdownLoader(markdown_path)

Documentation reference > https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/markdown/

Upvotes: 0

ranguy
ranguy

Reputation: 11

Installing the below solved it somehow

     pip install python-magic python-magic-bin

found the solution here : some other question

Upvotes: 0

Related Questions