Reputation: 4512
my goal is very simple: I have a set of strings or a sentence and I want to find the most similar one within a text corpus.
For example I have the following text corpus: "The front of the library is adorned with the Word of Life mural designed by artist Millard Sheets."
And I'd like to find the substring of the original corpus which is most similar to: "the library facade is painted"
So what I should get as output is: "fhe front of the library is adorned"
The only thing I came up with is to split the original sentence in substrings of variable lengths (eg. in substrings of 3,4,5 strings) and then use something like string.similarity(substring)
from the spacy
python module to assess the similarities of my target text with all the substrings and then keep the one with the highest value.
It seems a pretty inefficient method. Is there anything better I can do?
Upvotes: 1
Views: 1447
Reputation: 11474
It probably works to some degree, but I wouldn't expect the spacy similarity method (averaging word vectors) to work particularly well.
The task you're working on is related to paraphrase detection/identification and semantic textual similarity and there is a lot of existing work. It is frequently used for things like plagiarism detection and the evaluation of machine translation systems, so you might find more approaches by looking in those areas, too.
If you want something that works fairly quickly out of the box for English, one suggestion is terp, which was developed for MT evaluation but shown to work well for paraphrase detection:
https://github.com/snover/terp
Most methods are set up to compare two sentences, so this doesn't address your potential partial sentence matches. Maybe it would make sense to find the most similar sentence and then look for substrings within that sentence that match better than the sentence as a whole?
Upvotes: 1