Pantastix
Pantastix

Reputation: 342

Problem Setting up a FAISS vector memory in Python with embeddings

I'm trying to run an LLM locally and feed it with the contents of a very large PDF. I have decided to try this via a RAG. For this I wanted to create a vectorstore, which contains the content of the pdf. however, I have a problem here when creating, which I can not solve, because I am still quite new in this area.

The problem is that I use FAISS and don't know how to pass my values to the .from_embeddings. As a result, I have already received several errors.

My code looks like this:

import re
import PyPDF2
from nltk.tokenize import sent_tokenize  # After downloading resources
from sentence_transformers import SentenceTransformer
from langchain_community.vectorstores import FAISS  # Updated import

def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the PDF.
    """

    with open(pdf_path, 'rb') as pdf_file:
        reader = PyPDF2.PdfReader(pdf_file)
        text = ""
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()
        return text


if __name__ == "__main__":
    pdf_path = ""  # Replace with your actual path

    text = extract_text_from_pdf(pdf_path)
    print("Text extracted from PDF file successfully.")

    # Preprocess text to remove special characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)  # Remove non-ASCII characters

    sentences = sent_tokenize(text)
    print(sentences)  # Print the extracted sentences

    # Filter out empty sentences (optional)
    sentences = [sentence for sentence in sentences if sentence.strip()]

    model_name = 'all-MiniLM-L6-v2'
    model = SentenceTransformer(model_name)

    # Ensure model.encode(sentences) returns a list of NumPy arrays
    embeddings = model.encode(sentences)

    vectorstore = FAISS.from_embeddings(embeddings, sentences_list=sentences)#problem here
    print("Vector store created successfully.")

    # Example search query (replace with your actual question)
    query = "Was sind die wichtigsten Worte?"
    search_results = vectorstore.search(query)
    print("Search results:")
    for result in search_results:
        print(result)

If I execute the code as it is there, then the following error occurs:

Traceback (most recent call last):
  File “/Users/user/PycharmProjects/PythonProject/extract_pdf_text.py”, line 53, in <module>
    vectorstore = FAISS.from_embeddings(embeddings, sentences_list=sentences)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: FAISS.from_embeddings() missing 1 required positional argument: 'embedding'

However, if I now write vectorstore = FAISS.from_embeddings(embedding= embeddings, sentences_list=sentences), then the text_embeddings parameter is missing

How do I have to fill the parameters so that I can use this, or is there a better way to implement this?

Upvotes: 0

Views: 109

Answers (0)

Related Questions