Lance Pollard
Lance Pollard

Reputation: 79440

How to use AI or vector embedding approaches to find multi-word anagrams for arbitrary user input, against a SQL database of words?

I think I've figured out a reasonable solution to the anagram problem, in SQL, which I've answered here, and uses a DB system basically like this:

CREATE TABLE language_string (
    id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    text VARCHAR(255) NOT NULL UNIQUE
);

CREATE TABLE language_string_symbol (
    string__id BIGINT NOT NULL,
    symbol CHAR(1) NOT NULL,
    count INT NOT NULL,
    PRIMARY KEY (string__id, symbol)
);

With query example:

SELECT s.id
FROM language_string s
JOIN language_string_symbol sym ON s.id = sym.string__id
WHERE sym.symbol IN ('h', 'e', 'l', 'o', 'w', 'r', 'd') -- Only relevant symbols
GROUP BY s.id
HAVING COUNT(CASE WHEN sym.symbol = 'h' AND sym.count <= 1 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'e' AND sym.count <= 1 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'l' AND sym.count <= 3 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'o' AND sym.count <= 2 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'w' AND sym.count <= 1 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'r' AND sym.count <= 1 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'd' AND sym.count <= 1 THEN 1 END) >= 1;

This makes it possible to paginate over the anagrams for some user input string, and find all substring anagrams as well, somewhat efficiently for large datasets (it seems).

The problem though, is this only matches against existing strings in the database, it won't find multi-word strings if they aren't already in the database. I partially address this in the answer link above, but it is very inefficient and produces undesirable results, like searching "twelve plus one" might return:

Eleven plus two
Eleven two plus
plus two eleven
plus eleven two
Two plus eleven
Two eleven plus

Basically it produces every combination, which would be annoying to see as a user looking for fun anagrams.

So that got me wondering, what if you could somehow find anagrams which are common strings in a corpus, like in Wikipedia or the CommonCrawl corpus? So it would somehow take any arbitrary user input like araceraceraceaaa, and instead of finding all combinations like:

a race race race a a a
a race race race aa a
a race race race a aa
a race race race aaa
a racer ace race a a a
a racer ace race aa a
a racer ace race a aa
a racer ace race aaa
...

It would find only what is in the corpus or database directly (individual words, it would for sure find, but multi-word combinations must be explicitly in the database, or else they must be found in the corpus):

a racer ace
a race race
a care race
a race care
...

So basically (sorted longest to shortest), it would find only the expressions found on Wikipedia let's say (I'm just making up the response data in the last code snippet).

Is that possible to do, efficiently, somehow? If so, what would go into a solution, at a high level? What pieces do I need to wire together basically?

By efficiently, I don't want to store, like, every 2-word, 3-word, and (let's say up-to) 4-word sequence from a terabyte+ corpus in the database, that would be too much data for my needs. That would easily solve it.

But is there a way to somehow turn the corpus into vector data of some sort, such that you could match the input against it and efficiently get out the anagrams found in the corpus? Is there any trickery that could make that possible in a somewhat efficient way?

I haven't clearly defined "efficient", but basically I'm looking to not have to do one of these:

  1. Traverse the terabyte+ corpus for 2/3/4 word sequences at runtime.
  2. Store all 2/3/4 word sequences in the database.
  3. Keep the whole corpus all in memory instead of SQL.

I would just like to somehow convert it to a vector embedding somehow, and store that, if somehow possible. I am coming from this transformers example of creating vector embeddings for SQL:

import type { FeatureExtractionPipeline } from '@xenova/transformers'

// Create a reusable pipeline
let embeddingPipeline: FeatureExtractionPipeline | undefined

async function getEmbeddingPipeline() {
  if (!embeddingPipeline) {
    const { pipeline } = await import('@xenova/transformers')
    embeddingPipeline = await pipeline(
      'feature-extraction',
      // 'Xenova/all-MiniLM-L6-v2',
      'Xenova/bge-large-en-v1.5',
    )
  }
  return embeddingPipeline
}

// const embedding = await generateEmbedding("Oak tree. Quercus X. Plant.");
export async function generateEmbedding(
  text: string,
): Promise<Array<number>> {
  try {
    const pipe = await getEmbeddingPipeline()
    const output = await pipe(text, { pooling: 'mean' })
    return Array.from(output.data)
  } catch (error) {
    console.error('Error generating embedding:', error)
    throw error
  }
}

Or if vector embeddings magic is not possible, some other trickery to store this for fast queries against corpus matches. Is anything like that possible?

Upvotes: 0

Views: 34

Answers (0)

Related Questions