Reputation: 399
I have one paragrah of text (a vector of words) and I would like to see if it is "part" of a long text (a vector of words). However, I am know that this paragraph does not appear in the text in its exact form, but with slight changes: a few words could miss, the order could be slightly different, some words could be inserted as parenthetical elements etc.
I am currently implementing solutions "by hand", such as looking if most of the words of the paragraph are in the text, looking the distance between these words, their order, etc... I was however wondering if there is no built-in method to do that?
I already checked the tm
package, but it does not seem to do that...
Any idea?
Upvotes: 0
Views: 153
Reputation: 1306
I fear that you are stuck with hand-writing an approach, e.g. grep
-ing some word groups and having some kind of matching threshold.
Upvotes: 1