Reputation: 65
I am trying to make a database with document embeddings using pgvector, but I would like to know if there is a way to have some field in the collection of documents so that I can categorize the documents, example: if I have multiple companies and each one has different information , and I don't want to mix their data, can't I query the documents for some field and thus only obtain those related to that company? Like space-name in pinecone.
create table documents (
id bigserial primary key,
content text,
embedding vector(1536),
company_id text // Custom id for each company
);
const { data: documents, error } = await supabaseClient.rpc('match_documents', {
query_embedding: embedding,
match_threshold: .73,
match_count: 10,
company_id: "" // pass id to func
})
Inside the match_documents function filter to only get embeddings from that company.
How can I do this?
create or replace function match_documents (
query_embedding vector(384),
match_threshold float,
match_count int
)
returns table (
id bigint,
content text,
similarity float
)
language sql stable
as $$
select
documents.id,
documents.content,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where 1 - (documents.embedding <=> query_embedding) > match_threshold and documents.custom_id = custom_id
order by similarity desc
limit match_count;
$$;
Upvotes: 0
Views: 530
Reputation: 1
You're on the right track by adding the company_id
field to your table. I would recommend using a different, more appropriate type for the company_id
field though (bigint
, bigserial
, etc.):
create table documents (
id bigserial primary key,
content text,
embedding vector(1536),
company_id bigserial
);
Then, you should add an index for this field for faster lookups:
CREATE INDEX company_id_index ON documents (company_id);
Finally, modify your function to include company_id
as a parameter and use it to filter the query results:
create or replace function match_documents (
query_embedding vector(384),
match_threshold float,
match_count int,
company_id bigserial
)
returns table (
id bigint,
content text,
similarity float
)
language sql stable
as $$
select
documents.id,
documents.content,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where 1 - (documents.embedding <=> query_embedding) > match_threshold and documents.company_id = company_id
order by similarity desc
limit match_count;
$$;
This will provide documents from a specific company based on their embeddings.
Upvotes: 0