Santiago Padilla
Santiago Padilla

Reputation: 65

Supabase pgvector with custom data for query

I am trying to make a database with document embeddings using pgvector, but I would like to know if there is a way to have some field in the collection of documents so that I can categorize the documents, example: if I have multiple companies and each one has different information , and I don't want to mix their data, can't I query the documents for some field and thus only obtain those related to that company? Like space-name in pinecone.

create table documents (
  id bigserial primary key,
  content text,
  embedding vector(1536),
company_id text // Custom id for each company
);


  const { data: documents, error } = await supabaseClient.rpc('match_documents', {
    query_embedding: embedding,
    match_threshold: .73,
    match_count: 10,
    company_id: "" // pass id to func
  })

Inside the match_documents function filter to only get embeddings from that company.

How can I do this?

    create or replace function match_documents (
  query_embedding vector(384),
  match_threshold float,
  match_count int
)
returns table (
  id bigint,
  content text,
  similarity float
)
language sql stable
as $$
  select
    documents.id,
    documents.content,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > match_threshold and documents.custom_id = custom_id
  order by similarity desc
  limit match_count;
$$;

Upvotes: 0

Views: 530

Answers (1)

GPM
GPM

Reputation: 1

You're on the right track by adding the company_id field to your table. I would recommend using a different, more appropriate type for the company_id field though (bigint, bigserial, etc.):

create table documents (
  id bigserial primary key,
  content text,
  embedding vector(1536),
  company_id bigserial
);

Then, you should add an index for this field for faster lookups:

CREATE INDEX company_id_index ON documents (company_id);

Finally, modify your function to include company_id as a parameter and use it to filter the query results:

create or replace function match_documents (
  query_embedding vector(384),
  match_threshold float,
  match_count int,
  company_id bigserial
)
returns table (
  id bigint,
  content text,
  similarity float
)
language sql stable
as $$
  select
    documents.id,
    documents.content,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > match_threshold and documents.company_id = company_id
  order by similarity desc
  limit match_count;
$$;

This will provide documents from a specific company based on their embeddings.

Upvotes: 0

Related Questions