Victor Ribeiro
Victor Ribeiro

Reputation: 628

How to find text similarity within millions of entries?

Having used Spacy to find similarity across few texts, now I'm trying to find similar texts in millions of entries (instantaneously).

I have an app with millions of texts and I'd like to present the user with similar texts if they ask to.

How sites like StackOverflow find similar questions so fast?

I can imagine 2 approaches:

  1. Each time a text is inserted, the entire DB is compared and a link is done between both questions (in a intermediate table with both foreign keys)
  2. Each time a text is inserted, the vector is inserted in a field associated with this text. Whenever a user asks for similar texts, its "searches" the DB for similar texts.

My doubt is with the second choice. Storing the word vector is enough for searching quickly for similar texts?

Upvotes: 1

Views: 895

Answers (2)

Thomas Kimber
Thomas Kimber

Reputation: 11067

You want a function that can map quickly from a text, into a multi-dimensional space. Your collection of documents should be indexed with respect to that space such that you can quickly find the shortest-distance match between your text, and those in the space.

Algorithms exist that will speed up that indexing process - but could be as simple as sub-indexing the space into shards or blocks on a less granular basis and narrowing down the search like that.

One simple way of defining such a space might be on term-frequency (TF), term-frequency-inverse document frequency (TFIDF) - but without defining a limit on your vocabulary size, these can suffer from space/accuracy issues - still, with a vocabulary of the most specific 100 words in a corpus, you should be able to get a reasonable indication of similarity that would scale to millions of results. It depends on your corpus.

There are plenty of alternative features you might consider - but all of them will resolve to having a reliable method of transforming your document into a geometric vector, which you can then interrogate for similarity.

Upvotes: 1

ohlr
ohlr

Reputation: 1909

Comparing all the texts every time a new request comes in is infeasible.

To be really fast on large datasets I can recommend Locality-sensitive Hasing (LSH). It gives you entries that are similar with high probability. It significantly reduces the Complexity of your algorithm.

However, you have to train your algorithm once - that may take time - but after that it's very fast.

https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134 https://en.wikipedia.org/wiki/Locality-sensitive_hashing

Here is a tutorial that seems close to your application: https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/

Upvotes: 1

Related Questions