How to efficiently search a large set of documents to find ones that are composed of 90-95% words that exist in a given word set

Question

I'm building a platform that teaches users languages by always showing them content that is slightly above their level (they understand 90-95% of the words, and 5-10% of the words are new to them).

I have a large database of content (millions of items), and I'm trying to figure out how to efficiently search for documents within this 90-95% range.

Brute force filtering through every document is too slow. How can I do this more efficiently? Would a vector DB make sense for my use case? Are there preprocessing steps that would help here?

How to efficiently search a large set of documents to find ones that are composed of 90-95% words that exist in a given word set

Answers (1)

Related Questions