mongy910
mongy910

Reputation: 497

How to efficiently search a large set of documents to find ones that are composed of 90-95% words that exist in a given word set

I'm building a platform that teaches users languages by always showing them content that is slightly above their level (they understand 90-95% of the words, and 5-10% of the words are new to them).

I have a large database of content (millions of items), and I'm trying to figure out how to efficiently search for documents within this 90-95% range.

Brute force filtering through every document is too slow. How can I do this more efficiently? Would a vector DB make sense for my use case? Are there preprocessing steps that would help here?

Upvotes: 1

Views: 144

Answers (1)

Bohemian
Bohemian

Reputation: 425278

Do a one-time pass over all documents (and when updating/inserting) to calculate the "percentage of words that exist in a given word set" and save it as an attribute of the document.

To improve the performance on finding ones with a high percentage, you could create an index over the calculated value.

To find them even faster, save the ids of high-percentage documents in a separate dedicated table and retrieve the document by its id. There is more work to keep the extra table up to date when documents change, but it would be fast on read.

If the content never changes maintaining and using the metadata is far simpler.

Upvotes: 0

Related Questions