Reputation: 79228
Looking at https://github.com/pgvector/pgvector?tab=readme-ov-file#querying for inspiration, do you happen to know how to do a Kysely query for the embedding? I am new to vector embeddings and everything, but I have created embeddings for dictionary definitions for an English dictionary, and I want to find all the definitions which "closely match" the input query term. Do you happen to know how to make that kind of query?
const definitions = await db
.selectFrom('definitions')
.select(['id', 'language_id', 'entry_id'])
.where('embedding', '<=>', cosineDistance('embedding', embeddingQ))
.orderBy(cosineDistance('embedding', embeddingQ))
.limit(5)
.execute()
I get:
error: operator does not exist: double precision <=> unknown
This version (without the where
clause) works, but then it will always return something, even if it isn't a match or even a close match.
const definitions = await db
.selectFrom('definitions')
.select(['id', 'language_id', 'entry_id'])
// .where('embedding', '<=>', cosineDistance('embedding', embeddingQ))
.orderBy(cosineDistance('embedding', embeddingQ))
.limit(5)
.execute()
Any ideas how to get this to work in Kysely?
Full code:
import 'dotenv/config'
import { cosineDistance } from 'pgvector/kysely'
import { pipeline } from '@xenova/transformers'
import { createClient } from '@supabase/supabase-js'
import pgvector from 'pgvector/kysely'
import { db } from '~/site/utility/kysely'
import { v4 as uuidv4 } from 'uuid'
// https://supabase.com/docs/guides/ai/vector-columns
// https://github.com/pgvector/pgvector-node?tab=readme-ov-file#kysely
const supabaseUrl = String(process.env.SUPABASE_URL!)
const supabaseKey = String(process.env.SUPABASE_KEY!)
const supabase = createClient(supabaseUrl, supabaseKey)
const generateEmbedding = await pipeline(
'feature-extraction',
'Supabase/gte-small',
)
await db.deleteFrom('definitions').execute()
const title = 'Hello post!'.toLowerCase()
const body = 'Hello world!'.toLowerCase()
const query = 'hello'
// Generate a vector using Transformers.js
const a = await generateEmbedding(body, {
pooling: 'mean',
normalize: true,
})
const b = await generateEmbedding(title, {
pooling: 'mean',
normalize: true,
})
const q = await generateEmbedding(query, {
pooling: 'mean',
normalize: true,
})
const embeddingA = Array.from(a.data)
const embeddingB = Array.from(b.data)
const embeddingQ = Array.from(q.data)
const latin = await db
.selectFrom('languages')
.selectAll()
.where('slug', '=', 'latin')
.executeTakeFirstOrThrow()
const newItems = [
{
id: uuidv4(),
embedding: pgvector.toSql(embeddingA),
language_id: latin.id,
entry_id: 1,
},
{
id: uuidv4(),
embedding: pgvector.toSql(embeddingB),
language_id: latin.id,
entry_id: 2,
},
]
const definitions = await db
.selectFrom('definitions')
.select(['id', 'language_id', 'entry_id'])
// .where('embedding', '<=>', cosineDistance('embedding', embeddingQ))
.orderBy(cosineDistance('embedding', embeddingQ))
.limit(5)
.execute()
Supabase docs show this, but not quite sure how to translate that to Kysely, or if I can just use a simple where
query builder in Kysely, instead of defining a SQL function:
create or replace function match_documents (
query_embedding vector(384),
match_threshold float,
match_count int
)
returns table (
id bigint,
title text,
body text,
similarity float
)
language sql stable
as $$
select
documents.id,
documents.title,
documents.body,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where 1 - (documents.embedding <=> query_embedding) > match_threshold
order by (documents.embedding <=> query_embedding) asc
limit match_count;
$$;
This is what I did that seems to work, but if this is the only (correct) way to do that, my next question is how to automatically set a similarity score threshold:
// where 1 - (embedding <=> ${vectorToSql(embeddingQ)}) > ???something???
const definitions = await db.executeQuery(
sql`
select id, language__id, entry__id, 1 - (embedding <=>
${vectorToSql(embeddingQ)}) as similarity
from definitions
order by (embedding <=> ${vectorToSql(embeddingQ)}) asc
limit 10;`.compile(db),
)
function vectorToSql(value) {
if (Array.isArray(value)) {
return JSON.stringify(value.map(v => Number(v)))
}
return value
}
Upvotes: 1
Views: 285