Gianluca Baglini
Gianluca Baglini

Reputation: 1

Issue with Storing and Loading Index Timescale Vector Llama Index

I'm currently working with the llama_index Python package and using the llama-index-vector-stores-timescalevector extension to manage my vectors with Timescale. However, I’ve encountered an issue where I’m unable to store the index for future use, which means I have to recreate it every time I run my code. This is quite inefficient and not ideal for my use case.

I followed this tutorial: TimescaleVector Example, but it doesn't mention how to store and later load the index.

Here’s a snippet of my code setup. The csv is available at this link

pip install llama_index llama-index-vector-stores-postgres llama-index-embeddings-openai llama-index-vector-stores-timescalevector

import llama_index
from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.vector_stores import VectorStoreQuery, MetadataFilters
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo
from llama_index.vector_stores.timescalevector import TimescaleVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
import pandas as pd
import os
import time
from datetime import datetime, timedelta

# API keys and paths hidden for security
os.environ["OPENAI_API_KEY"] = 'your_openai_api_key'
os.environ["TIMESCALE_SERVICE_URL"] = 'your_timescale_service_url'

# Load and process data
reuters = pd.read_csv('your_file_path')
reuters.columns = ["title", "date", "description"]

# Function to take in a date string in the past and return a uuid v1
def create_uuid2(date_string: str):
    if date_string is None:
        return None
    time_format = '%b %d %Y'
    datetime_obj = datetime.strptime(date_string, time_format)
    uuid = timescale_client.uuid_from_time(datetime_obj)
    return str(uuid)

def create_date2(input_string: str) -> datetime:
    if input_string is None:
        return None
    # Convert the string to a datetime object using strptime
    date_object = datetime.strptime(input_string, '%b %d %Y')

    # Define the time as midnight and the desired timezone offset
    time = "00:00:00"
    timezone_hours = 8
    timezone_minutes = 50

    # Create the formatted string
    timestamp_tz_str = f"{date_object.year}-{date_object.month:02}-{date_object.day:02} {time}+{timezone_hours:02}{timezone_minutes:02}"
    return timestamp_tz_str



# Create a Node object from a single row of data
def create_node2(row):
    record = row.to_dict()
    record_content = (
        record["date"]
        + " "
        + record["title"]
        + " "
        + record["description"]
    )
    # Can change to TextNode as needed
    node = TextNode(
        id_=create_uuid2(str(record["date"])),
        text=record_content,
        metadata={
            "title": record["title"],
            "date": create_date2(str(record["date"])),
        },
    )

    return node


# Create nodes and embeddings
nodes = [create_node2(row) for _, row in reuters.iterrows()]
embedding_model = OpenAIEmbedding()

# Add nodes to Timescale Vector Store
ts_vector_store = TimescaleVectorStore.from_params(
    service_url=os.environ["TIMESCALE_SERVICE_URL"],
    table_name="reuters_test"
)
_ = ts_vector_store.add(nodes[:100])

# Tried with this function. It runs but I don't know where the index is saved
ts_vector_store.create_index("aaa")
# Also, attempt to store the index (currently not working as expected)
storage_context = StorageContext.from_defaults(persist_dir="your_persist_dir")
index.storage_context.persist(persist_dir="your_persist_dir") #not clear how to retrieve the index variable

from llama_index.core import load_index_from_storage

# load a single index
# need to specify index_id if multiple indexes are persisted to the same directory
index = load_index_from_storage(storage_context)

This is the error that I am getting when using the function load_index_from_storage

KeyError                                  Traceback (most recent call last)
<ipython-input-15-77df363ff364> in <cell line: 6>()
      4     load_graph_from_storage,
      5 )
----> 6 index = load_index_from_storage(storage_context)

4 frames
/usr/local/lib/python3.10/dist-packages/llama_index/core/storage/storage_context.py in vector_store(self)
    262     def vector_store(self) -> BasePydanticVectorStore:
    263         """Backwrds compatibility for vector_store property."""
--> 264         return self.vector_stores[DEFAULT_VECTOR_STORE]
    265 
    266     def add_vector_store(

KeyError: 'default'

Does anyone have experience with the llama-index-vector-stores-timescalevector package? How can I properly store and reload the index to avoid having to recreate it each time? Any guidance on the correct method or any relevant documentation would be greatly appreciated.

I expected to be able to store the index and later reload it without needing to recreate it from scratch.

Upvotes: 0

Views: 243

Answers (1)

jonatasdp
jonatasdp

Reputation: 1412

You can use this link for querying existing index.

In summary, VectorStoreIndex is the missing piece.

ts_vector_store = TimescaleVectorStore.from_params(
    service_url=os.environ["TIMESCALE_SERVICE_URL"],
    table_name="reuters_test"
)
index = VectorStoreIndex.from_vector_store(vector_store=ts_vector_store)
query_engine = index.as_query_engine()
response = query_engine.query("My question here")

Upvotes: 0

Related Questions