Epiphastro
Epiphastro

Reputation: 71

Find identical phrases in strings

I have sets of texts. The texts are rather short (up to 500 characters). The sets consist of up to 10 texts. I try to find phrases as long as possible which occur in most of the texts. In other words I am looking for identical substrings. The longer the better and the more texts they occur in also the better.

Example:

Phrases (one word phrases ommitted):

Phrases like "brown fox" or "fox jumps" are omitted, because they are subphrases of longer phrases.

I am looking for an algorithm to find those phrases.

Upvotes: 1

Views: 48

Answers (2)

גלעד ברקן
גלעד ברקן

Reputation: 23955

As every substring is a prefix of some suffix, traversing a generalised suffix tree ought to let us compare paths both by how many leaves from different texts they share, indicating quantity of texts sharing a substring, as well as how long the distances to the leaves, indicating shared substring lengths.

Upvotes: 1

Timothée Bronner
Timothée Bronner

Reputation: 26

Finding the longuest commun substring is a commun DP algorithm explained pretty well here. https://www.geeksforgeeks.org/longest-common-substring-dp-29/ . After to find the occurence of the strings in the set of text you can simply use a code loke this.

substring = "red brown fox"
n = 0
for text in texts:
    if substring in text:
       n = n + 1
print(substring, n)

Upvotes: 1

Related Questions