Langchain - Word Documents

Question

I am trying to query a stack of word documents using langchain, yet I get the following traceback.

May I ask what's the argument that's expected here?

Also, side question, is there a way to do such a query locally (without internet access and openai)?

Traceback:

Traceback (most recent call last):

  File C:\Program Files\Spyder\pkgs\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File c:\data\langchain\langchaintest.py:44
    index = VectorstoreIndexCreator().from_loaders(loaders)

  File ~\AppData\Roaming\Python\Python38\site-packages\langchain\indexes\vectorstore.py:72 in from_loaders
    docs.extend(loader.load())

  File ~\AppData\Roaming\Python\Python38\site-packages\langchain\document_loaders	ext.py:17 in load
    with open(self.file_path, encoding=self.encoding) as f:

OSError: [Errno 22] Invalid argument:

... where "invalid argument: " is followed by the raw text from the word document.

Code:

import os
os.environ["OPENAI_API_KEY"] = "xxxxxx"


import os
import docx
from langchain.document_loaders import TextLoader

# Function to get text from a docx file
def get_text_from_docx(file_path):
    doc = docx.Document(file_path)
    full_text = []
    for paragraph in doc.paragraphs:
        full_text.append(paragraph.text)
    
    return '
'.join(full_text)

# Load multiple Word documents
folder_path = 'C:/Data/langchain'
word_files = [os.path.join(folder_path, file) for file in os.listdir(folder_path) if file.endswith('.docx')]

loaders = []
for word_file in word_files:
    text = get_text_from_docx(word_file)
    loader = TextLoader(text)
    loaders.append(loader)
    
    
from langchain.indexes import VectorstoreIndexCreator

index = VectorstoreIndexCreator().from_loaders(loaders)

query = "What are the main points discussed in the documents?"

responses = index.query(query)
print(responses)

results_with_source=index.query_with_sources(query)
print(results_with_source)

andrew_reece · Accepted Answer

The issue is that TextLoader expects a file path string, not raw text - it is designed to load in text files. Here's the TextLoader.__init__() definition:

class TextLoader(BaseLoader):
    """Load text files."""

    def __init__(self, file_path: str, encoding: Optional[str] = None):
        """Initialize with file path."""
        self.file_path = file_path
        self.encoding = encoding

You might find the Docx2txtLoader useful for working with Word docs.

Langchain - Word Documents

Answers (1)

Related Questions