Reputation: 143
I am trying to query a stack of word documents using langchain, yet I get the following traceback.
May I ask what's the argument that's expected here?
Also, side question, is there a way to do such a query locally (without internet access and openai)?
Traceback:
Traceback (most recent call last):
File C:\Program Files\Spyder\pkgs\spyder_kernels\py3compat.py:356 in compat_exec
exec(code, globals, locals)
File c:\data\langchain\langchaintest.py:44
index = VectorstoreIndexCreator().from_loaders(loaders)
File ~\AppData\Roaming\Python\Python38\site-packages\langchain\indexes\vectorstore.py:72 in from_loaders
docs.extend(loader.load())
File ~\AppData\Roaming\Python\Python38\site-packages\langchain\document_loaders\text.py:17 in load
with open(self.file_path, encoding=self.encoding) as f:
OSError: [Errno 22] Invalid argument:
... where "invalid argument: " is followed by the raw text from the word document.
Code:
import os
os.environ["OPENAI_API_KEY"] = "xxxxxx"
import os
import docx
from langchain.document_loaders import TextLoader
# Function to get text from a docx file
def get_text_from_docx(file_path):
doc = docx.Document(file_path)
full_text = []
for paragraph in doc.paragraphs:
full_text.append(paragraph.text)
return '\n'.join(full_text)
# Load multiple Word documents
folder_path = 'C:/Data/langchain'
word_files = [os.path.join(folder_path, file) for file in os.listdir(folder_path) if file.endswith('.docx')]
loaders = []
for word_file in word_files:
text = get_text_from_docx(word_file)
loader = TextLoader(text)
loaders.append(loader)
from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders(loaders)
query = "What are the main points discussed in the documents?"
responses = index.query(query)
print(responses)
results_with_source=index.query_with_sources(query)
print(results_with_source)
Upvotes: 0
Views: 3971
Reputation: 21274
The issue is that TextLoader
expects a file path string, not raw text - it is designed to load in text files. Here's the TextLoader.__init__()
definition:
class TextLoader(BaseLoader):
"""Load text files."""
def __init__(self, file_path: str, encoding: Optional[str] = None):
"""Initialize with file path."""
self.file_path = file_path
self.encoding = encoding
You might find the Docx2txtLoader
useful for working with Word docs.
Upvotes: 1