Reputation: 27

AstraDBVectorStore add_documents() returns exception "'dict' object has no attribute 'page_content'"

def store_embeddings_in_astradb(embeddings,text_chunks, metadata):

    vstore = AstraDBVectorStore(
        collection_name="test",
        embedding=embedding_model,
        token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
        api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
    )
    print("after Vstore")

    # Create documents with page content, embeddings, and metadata
    documents = [
        {
            "page_content": chunk,
            "metadata": metadata
        }
        for chunk in text_chunks
    ]
    for doc in documents:
        print(f"Document structure: {doc}")
    print("after documents")

    # Add documents to AstraDB vector store
    inserted_ids = vstore.add_documents(documents)
    return inserted_ids
# List of PDF files to process
pdf_files = ["WhatYouNeedToKnowAboutWOMENSHEALTH.pdf", "Womens-Health-Book.pdf"]

# Initialize embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Process each PDF file
for pdf_file in pdf_files:
    if not os.path.isfile(pdf_file):
        raise ValueError(f"PDF file '{pdf_file}' not found.")

    print(f"Processing file: {pdf_file}")

    # Extract text from PDF
    text = extract_text_from_pdf(pdf_file)

    # Split text into chunks
    text_chunks = split_text_into_chunks(text)

    # Embed text chunks
    embeddings = embed_text_chunks(text_chunks, embedding_model)

    # Extract metadata
    metadata = extract_metadata(pdf_file)

    # Store embeddings in AstraDB
    try:
        inserted_ids = store_embeddings_in_astradb(embeddings,text_chunks, metadata)
        print(f"Inserted {len(inserted_ids)} embeddings from '{pdf_file}' into AstraDB.")
    except Exception as e:
        print(f"Failed to insert embeddings for '{pdf_file}': {e}")

This is the code iam using to convert text chunks into embeddings and then store them in the AstraDB. At the time of insertion iam getting error 'dict' object has no attribute 'page_content'. How to resolve it?

Upvotes: 0

Answers (2)

Stefano L

Reputation: 111

I agree with all remarks by Erick above (the LangChain vector store class will take care of embedding computation, and it is important the vector store instance is created once as the instantiation has some overhead: so you gain substantial performance by sharing a single vectorstore throughout calls).

Now to the core of the problem: the code above is mixing LangChain abstractions and bare-bones Python structures (dictionaries). Since you are using the LC vector store (AstraDBVectorStore instance) you should pass (a list of) the corresponding LC abstraction for documents, instead of dictionaries, to the add_documents method. Please add the following import and replace the documents= ... statement as follows:

from langchain_core.documents import Document

[...]

# replace the `documents = ...` part with:

    documents = [
        Document(
            page_content=chunk,
            metadata=metadata,
        )
        for chunk in text_chunks
    ]

[...]

This should now work as intended.

Side note:

if you don't feel like creating Documents for the sole transient purpose of passing them to the vector store add_documents method, keep in mind you also have the option to call add_texts, and pass directly two parallel lists of texts and metadata dicts to the vector store:

vstore.add_texts(["text 1", "text 2", ...], metadatas=[{...}, {...}, ...])

(the above also supports a nice ids=... third list argument if you want to impose your own string IDs to documents: that helps in case you re-run the insertion, since it allow you to avoid storing duplicate entries in the vector store).

Upvotes: 1

Erick Ramirez

Reputation: 16373

I'm struggling to understand your code but I suspect the issue is that the variable scope is incorrect. If you include a minimal code sample plus steps to replicate the problem, I'd be happy to help you troubleshoot it.

As a side note, I would suggest not creating the AstraDBVectorStore object in a function because it is not necessary. You should only instantiate it once and share it for the life of your application.

Also when you make a call to AstraDBVectorStore.add_documents(), it will automatically generate embeddings for each document then store it in Astra DB so it's not necessary to make multiple calls to embed_text_chunks(). In fact, I can't see the embeddings variable being used anywhere. Cheers!

Upvotes: 0

AstraDBVectorStore add_documents() returns exception &quot;&#39;dict&#39; object has no attribute &#39;page_content&#39;&quot;

Answers (2)

Side note:

Related Questions

AstraDBVectorStore add_documents() returns exception "'dict' object has no attribute 'page_content'"