Using Bert and cosine similarity fo identify similar documents

Question

we have a news website where we have to match news to a particular user.

We have to use for the matching only the user textual information, like for example the interests of the user or a brief description about them.

I was thinking to threat both the user textual information and the news text as document and find document similarity.

In this way, I hope, that if in my profile I wrote sentences like: I loved the speach of the president in Chicago last year, and a news talks about: Trump is going to speak in Illinois I can have a match (the example is purely casual).

I tried, first, to embed my documents using TF-IDF and then I tried a kmeans to see if there was something that makes sense, but I don't like to much the results.

I think the problem derives from the poor embedding that TF-IDF gives me.

Thus I was thinking of using BERT embedding to retrieve the embedding of my documents and then use cosine similarity to check similarity of two document (a document about the user profile and a news).

Is this an approach that could make sense? Bert can be used to retrieve the embedding of sentences, but there is a way to embed an entire document?

What would you advice me?

Thank you

Using Bert and cosine similarity fo identify similar documents

Answers (1)

Related Questions