Reputation: 2347
I am writing a function to index some resources using llamaindex and Milvus for the vector db.
When storing the data, I also include metadata for each resource that is ingested. I am trying to understand what is the correct way to avoid re-indexing all the documents every time I call my function. Only the documents missing from the index should be included. The idea was to use an id I am keeping in my metadata.
This is how I ingest and persist my data without checking if a document is already indexed:
documents = SimpleDirectoryReader(
input_files=get_content_paths_list()
file_metadata=get_metadata_paths_list(),
).load_data()
Settings.embed_model = HuggingFaceEmbedding(model_name="dunzhang/stella_en_1.5B_v5")
# ollama
Settings.llm = Ollama(model="llama3.2", request_timeout=360.0)
storage_context = StorageContext.from_defaults(
vector_store=get_or_create_collection(dim=1024, collection_name="my_collection")
)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
Upvotes: 0
Views: 33