Reputation: 1
I'm currently working with the llama_index Python package and using the llama-index-vector-stores-timescalevector extension to manage my vectors with Timescale. However, I’ve encountered an issue where I’m unable to store the index for future use, which means I have to recreate it every time I run my code. This is quite inefficient and not ideal for my use case.
I followed this tutorial: TimescaleVector Example, but it doesn't mention how to store and later load the index.
Here’s a snippet of my code setup. The csv is available at this link
pip install llama_index llama-index-vector-stores-postgres llama-index-embeddings-openai llama-index-vector-stores-timescalevector
import llama_index
from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.vector_stores import VectorStoreQuery, MetadataFilters
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo
from llama_index.vector_stores.timescalevector import TimescaleVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
import pandas as pd
import os
import time
from datetime import datetime, timedelta
# API keys and paths hidden for security
os.environ["OPENAI_API_KEY"] = 'your_openai_api_key'
os.environ["TIMESCALE_SERVICE_URL"] = 'your_timescale_service_url'
# Load and process data
reuters = pd.read_csv('your_file_path')
reuters.columns = ["title", "date", "description"]
# Function to take in a date string in the past and return a uuid v1
def create_uuid2(date_string: str):
if date_string is None:
return None
time_format = '%b %d %Y'
datetime_obj = datetime.strptime(date_string, time_format)
uuid = timescale_client.uuid_from_time(datetime_obj)
return str(uuid)
def create_date2(input_string: str) -> datetime:
if input_string is None:
return None
# Convert the string to a datetime object using strptime
date_object = datetime.strptime(input_string, '%b %d %Y')
# Define the time as midnight and the desired timezone offset
time = "00:00:00"
timezone_hours = 8
timezone_minutes = 50
# Create the formatted string
timestamp_tz_str = f"{date_object.year}-{date_object.month:02}-{date_object.day:02} {time}+{timezone_hours:02}{timezone_minutes:02}"
return timestamp_tz_str
# Create a Node object from a single row of data
def create_node2(row):
record = row.to_dict()
record_content = (
record["date"]
+ " "
+ record["title"]
+ " "
+ record["description"]
)
# Can change to TextNode as needed
node = TextNode(
id_=create_uuid2(str(record["date"])),
text=record_content,
metadata={
"title": record["title"],
"date": create_date2(str(record["date"])),
},
)
return node
# Create nodes and embeddings
nodes = [create_node2(row) for _, row in reuters.iterrows()]
embedding_model = OpenAIEmbedding()
# Add nodes to Timescale Vector Store
ts_vector_store = TimescaleVectorStore.from_params(
service_url=os.environ["TIMESCALE_SERVICE_URL"],
table_name="reuters_test"
)
_ = ts_vector_store.add(nodes[:100])
# Tried with this function. It runs but I don't know where the index is saved
ts_vector_store.create_index("aaa")
# Also, attempt to store the index (currently not working as expected)
storage_context = StorageContext.from_defaults(persist_dir="your_persist_dir")
index.storage_context.persist(persist_dir="your_persist_dir") #not clear how to retrieve the index variable
from llama_index.core import load_index_from_storage
# load a single index
# need to specify index_id if multiple indexes are persisted to the same directory
index = load_index_from_storage(storage_context)
This is the error that I am getting when using the function load_index_from_storage
KeyError Traceback (most recent call last)
<ipython-input-15-77df363ff364> in <cell line: 6>()
4 load_graph_from_storage,
5 )
----> 6 index = load_index_from_storage(storage_context)
4 frames
/usr/local/lib/python3.10/dist-packages/llama_index/core/storage/storage_context.py in vector_store(self)
262 def vector_store(self) -> BasePydanticVectorStore:
263 """Backwrds compatibility for vector_store property."""
--> 264 return self.vector_stores[DEFAULT_VECTOR_STORE]
265
266 def add_vector_store(
KeyError: 'default'
Does anyone have experience with the llama-index-vector-stores-timescalevector package? How can I properly store and reload the index to avoid having to recreate it each time? Any guidance on the correct method or any relevant documentation would be greatly appreciated.
I expected to be able to store the index and later reload it without needing to recreate it from scratch.
Upvotes: 0
Views: 243
Reputation: 1412
You can use this link for querying existing index.
In summary, VectorStoreIndex
is the missing piece.
ts_vector_store = TimescaleVectorStore.from_params(
service_url=os.environ["TIMESCALE_SERVICE_URL"],
table_name="reuters_test"
)
index = VectorStoreIndex.from_vector_store(vector_store=ts_vector_store)
query_engine = index.as_query_engine()
response = query_engine.query("My question here")
Upvotes: 0