Reputation: 925
we have a news website where we have to match news to a particular user.
We have to use for the matching only the user textual information, like for example the interests of the user or a brief description about them.
I was thinking to threat both the user textual information and the news text as document and find document similarity.
In this way, I hope, that if in my profile I wrote sentences like: I loved the speach of the president in Chicago last year, and a news talks about: Trump is going to speak in Illinois I can have a match (the example is purely casual).
I tried, first, to embed my documents using TF-IDF and then I tried a kmeans to see if there was something that makes sense, but I don't like to much the results.
I think the problem derives from the poor embedding that TF-IDF gives me.
Thus I was thinking of using BERT embedding to retrieve the embedding of my documents and then use cosine similarity to check similarity of two document (a document about the user profile and a news).
Is this an approach that could make sense? Bert can be used to retrieve the embedding of sentences, but there is a way to embed an entire document?
What would you advice me?
Thank you
Upvotes: 2
Views: 3100
Reputation: 11258
BERT is trained on pairs of sentences, therefore it is unlikely to generalize for much longer texts. Also, BERT requires quadratic memory with the length of the text, using too long texts might result in memory issues. In most implementations, it does not accept sequences longer than 512 subwords.
Making pre-trained Transformers work efficiently for long texts is an active research area, you can have a look at a paper called DocBERT to have an idea what people are trying. But it will take some time until there is a nicely packaged working solution.
There are also other methods for document embedding, for instance Gensim implements doc2vec. However, I would still stick with TF-IDF.
TF-IDF is typically very sensitive to data pre-processing. You certainly need to remove stopwords, in many languages it also pays off to do lemmatization. Given the specific domain of your texts, you can also try expanding the standard list of stop words by words that appear frequently in news stories. You can get further improvements by detecting and keeping together named entities.
Upvotes: 2