Reputation: 27
def store_embeddings_in_astradb(embeddings,text_chunks, metadata):
vstore = AstraDBVectorStore(
collection_name="test",
embedding=embedding_model,
token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
)
print("after Vstore")
# Create documents with page content, embeddings, and metadata
documents = [
{
"page_content": chunk,
"metadata": metadata
}
for chunk in text_chunks
]
for doc in documents:
print(f"Document structure: {doc}")
print("after documents")
# Add documents to AstraDB vector store
inserted_ids = vstore.add_documents(documents)
return inserted_ids
# List of PDF files to process
pdf_files = ["WhatYouNeedToKnowAboutWOMENSHEALTH.pdf", "Womens-Health-Book.pdf"]
# Initialize embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# Process each PDF file
for pdf_file in pdf_files:
if not os.path.isfile(pdf_file):
raise ValueError(f"PDF file '{pdf_file}' not found.")
print(f"Processing file: {pdf_file}")
# Extract text from PDF
text = extract_text_from_pdf(pdf_file)
# Split text into chunks
text_chunks = split_text_into_chunks(text)
# Embed text chunks
embeddings = embed_text_chunks(text_chunks, embedding_model)
# Extract metadata
metadata = extract_metadata(pdf_file)
# Store embeddings in AstraDB
try:
inserted_ids = store_embeddings_in_astradb(embeddings,text_chunks, metadata)
print(f"Inserted {len(inserted_ids)} embeddings from '{pdf_file}' into AstraDB.")
except Exception as e:
print(f"Failed to insert embeddings for '{pdf_file}': {e}")
This is the code iam using to convert text chunks into embeddings and then store them in the AstraDB. At the time of insertion iam getting error 'dict' object has no attribute 'page_content'
. How to resolve it?
Upvotes: 0
Views: 229
Reputation: 111
I agree with all remarks by Erick above (the LangChain vector store class will take care of embedding computation, and it is important the vector store instance is created once as the instantiation has some overhead: so you gain substantial performance by sharing a single vectorstore throughout calls).
Now to the core of the problem: the code above is mixing LangChain abstractions and bare-bones Python structures (dictionaries). Since you are using the LC vector store (AstraDBVectorStore
instance) you should pass (a list of) the corresponding LC abstraction for documents, instead of dictionaries, to the add_documents
method. Please add the following import and replace the documents= ...
statement as follows:
from langchain_core.documents import Document
[...]
# replace the `documents = ...` part with:
documents = [
Document(
page_content=chunk,
metadata=metadata,
)
for chunk in text_chunks
]
[...]
This should now work as intended.
if you don't feel like creating Document
s for the sole transient purpose of passing them to the vector store add_documents
method, keep in mind you also have the option to call add_texts
, and pass directly two parallel lists of texts and metadata dicts to the vector store:
vstore.add_texts(["text 1", "text 2", ...], metadatas=[{...}, {...}, ...])
(the above also supports a nice ids=...
third list argument if you want to impose your own string IDs to documents: that helps in case you re-run the insertion, since it allow you to avoid storing duplicate entries in the vector store).
Upvotes: 1
Reputation: 16373
I'm struggling to understand your code but I suspect the issue is that the variable scope is incorrect. If you include a minimal code sample plus steps to replicate the problem, I'd be happy to help you troubleshoot it.
As a side note, I would suggest not creating the AstraDBVectorStore
object in a function because it is not necessary. You should only instantiate it once and share it for the life of your application.
Also when you make a call to AstraDBVectorStore.add_documents()
, it will automatically generate embeddings for each document then store it in Astra DB so it's not necessary to make multiple calls to embed_text_chunks()
. In fact, I can't see the embeddings
variable being used anywhere. Cheers!
Upvotes: 0