Reputation: 19
I want to create a corpus for a machine learning task. I have a small textual dataset and want to crawl similar sentences from web. I used sentence_transformers package with Bert pertained model, doc2vec and spacy similarity to measure similarity. I set the threshold to 85%, but the sentences with the similarity score higher than the threshold weren't really relevant. how can I crawl similar sentences from web in python?
Upvotes: 1
Views: 160
Reputation: 2358
I think you should train a big model on a big corpus and then use that model to generate random sentences. The gensim
library has several corpora
link that you can use to find similar sentences or to train a model that generates similar sentences , here is how to do it.
Upvotes: 1