Reputation: 342
I'm trying to run an LLM locally and feed it with the contents of a very large PDF. I have decided to try this via a RAG. For this I wanted to create a vectorstore, which contains the content of the pdf. however, I have a problem here when creating, which I can not solve, because I am still quite new in this area.
The problem is that I use FAISS and don't know how to pass my values to the .from_embeddings. As a result, I have already received several errors.
My code looks like this:
import re
import PyPDF2
from nltk.tokenize import sent_tokenize # After downloading resources
from sentence_transformers import SentenceTransformer
from langchain_community.vectorstores import FAISS # Updated import
def extract_text_from_pdf(pdf_path):
"""Extracts text from a PDF file.
Args:
pdf_path (str): Path to the PDF file.
Returns:
str: Extracted text from the PDF.
"""
with open(pdf_path, 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
text += page.extract_text()
return text
if __name__ == "__main__":
pdf_path = "" # Replace with your actual path
text = extract_text_from_pdf(pdf_path)
print("Text extracted from PDF file successfully.")
# Preprocess text to remove special characters
text = re.sub(r'[^\x00-\x7F]+', '', text) # Remove non-ASCII characters
sentences = sent_tokenize(text)
print(sentences) # Print the extracted sentences
# Filter out empty sentences (optional)
sentences = [sentence for sentence in sentences if sentence.strip()]
model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
# Ensure model.encode(sentences) returns a list of NumPy arrays
embeddings = model.encode(sentences)
vectorstore = FAISS.from_embeddings(embeddings, sentences_list=sentences)#problem here
print("Vector store created successfully.")
# Example search query (replace with your actual question)
query = "Was sind die wichtigsten Worte?"
search_results = vectorstore.search(query)
print("Search results:")
for result in search_results:
print(result)
If I execute the code as it is there, then the following error occurs:
Traceback (most recent call last):
File “/Users/user/PycharmProjects/PythonProject/extract_pdf_text.py”, line 53, in <module>
vectorstore = FAISS.from_embeddings(embeddings, sentences_list=sentences)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: FAISS.from_embeddings() missing 1 required positional argument: 'embedding'
However, if I now write vectorstore = FAISS.from_embeddings(embedding= embeddings, sentences_list=sentences)
, then the text_embeddings parameter is missing
How do I have to fill the parameters so that I can use this, or is there a better way to implement this?
Upvotes: 0
Views: 109