Lance Pollard
Lance Pollard

Reputation: 79228

How to do Vector Embedding query in Kysely/PostgreSQL with pgvector?

Looking at https://github.com/pgvector/pgvector?tab=readme-ov-file#querying for inspiration, do you happen to know how to do a Kysely query for the embedding? I am new to vector embeddings and everything, but I have created embeddings for dictionary definitions for an English dictionary, and I want to find all the definitions which "closely match" the input query term. Do you happen to know how to make that kind of query?

const definitions = await db
  .selectFrom('definitions')
  .select(['id', 'language_id', 'entry_id'])
  .where('embedding', '<=>', cosineDistance('embedding', embeddingQ))
  .orderBy(cosineDistance('embedding', embeddingQ))
  .limit(5)
  .execute()

I get:

error: operator does not exist: double precision <=> unknown

This version (without the where clause) works, but then it will always return something, even if it isn't a match or even a close match.

const definitions = await db
  .selectFrom('definitions')
  .select(['id', 'language_id', 'entry_id'])
  // .where('embedding', '<=>', cosineDistance('embedding', embeddingQ))
  .orderBy(cosineDistance('embedding', embeddingQ))
  .limit(5)
  .execute()

Any ideas how to get this to work in Kysely?

Full code:

import 'dotenv/config'
import { cosineDistance } from 'pgvector/kysely'
import { pipeline } from '@xenova/transformers'
import { createClient } from '@supabase/supabase-js'
import pgvector from 'pgvector/kysely'
import { db } from '~/site/utility/kysely'
import { v4 as uuidv4 } from 'uuid'

// https://supabase.com/docs/guides/ai/vector-columns
// https://github.com/pgvector/pgvector-node?tab=readme-ov-file#kysely

const supabaseUrl = String(process.env.SUPABASE_URL!)
const supabaseKey = String(process.env.SUPABASE_KEY!)
const supabase = createClient(supabaseUrl, supabaseKey)

const generateEmbedding = await pipeline(
  'feature-extraction',
  'Supabase/gte-small',
)

await db.deleteFrom('definitions').execute()

const title = 'Hello post!'.toLowerCase()
const body = 'Hello world!'.toLowerCase()
const query = 'hello'

// Generate a vector using Transformers.js
const a = await generateEmbedding(body, {
  pooling: 'mean',
  normalize: true,
})
const b = await generateEmbedding(title, {
  pooling: 'mean',
  normalize: true,
})
const q = await generateEmbedding(query, {
  pooling: 'mean',
  normalize: true,
})

const embeddingA = Array.from(a.data)
const embeddingB = Array.from(b.data)
const embeddingQ = Array.from(q.data)

const latin = await db
  .selectFrom('languages')
  .selectAll()
  .where('slug', '=', 'latin')
  .executeTakeFirstOrThrow()

const newItems = [
  {
    id: uuidv4(),
    embedding: pgvector.toSql(embeddingA),
    language_id: latin.id,
    entry_id: 1,
  },
  {
    id: uuidv4(),
    embedding: pgvector.toSql(embeddingB),
    language_id: latin.id,
    entry_id: 2,
  },
]

const definitions = await db
  .selectFrom('definitions')
  .select(['id', 'language_id', 'entry_id'])
  // .where('embedding', '<=>', cosineDistance('embedding', embeddingQ))
  .orderBy(cosineDistance('embedding', embeddingQ))
  .limit(5)
  .execute()

Supabase docs show this, but not quite sure how to translate that to Kysely, or if I can just use a simple where query builder in Kysely, instead of defining a SQL function:

create or replace function match_documents (
  query_embedding vector(384),
  match_threshold float,
  match_count int
)
returns table (
  id bigint,
  title text,
  body text,
  similarity float
)
language sql stable
as $$
  select
    documents.id,
    documents.title,
    documents.body,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where 1 - (documents.embedding <=> query_embedding) > match_threshold
  order by (documents.embedding <=> query_embedding) asc
  limit match_count;
$$;

This is what I did that seems to work, but if this is the only (correct) way to do that, my next question is how to automatically set a similarity score threshold:

// where 1 - (embedding <=> ${vectorToSql(embeddingQ)}) > ???something???
const definitions = await db.executeQuery(
  sql`
  select id, language__id, entry__id, 1 - (embedding <=> 
  ${vectorToSql(embeddingQ)}) as similarity
  from definitions
  order by (embedding <=> ${vectorToSql(embeddingQ)}) asc
  limit 10;`.compile(db),
)

function vectorToSql(value) {
  if (Array.isArray(value)) {
    return JSON.stringify(value.map(v => Number(v)))
  }
  return value
}

Upvotes: 1

Views: 285

Answers (0)

Related Questions