Trepik
Trepik

Reputation: 41

Getting percentage of similarity of two texts

I need to get the score of the similarity between texts, when one is inside the second.

For example:

Text1: aaa bbb ccc ddd eee
Text2: bbb ccc

I need somethig what say me, that Text2 is for 100% inside the Text1. Is there some way to do this?

Upvotes: 4

Views: 629

Answers (3)

Yuval F
Yuval F

Reputation: 20621

Please see the book Mining of Massive Datasets and Dekang Lin's definition of similarity (PDF). Both do not require Lucene.

Upvotes: 0

Mikos
Mikos

Reputation: 8553

You don't Lucene to obtain similarity between texts.There are several measures available depending on the text length, type of strings etc. and you will need to experiment which gives you the best results.

A pretty good and comprehensive collection of algorithms is available at SimMetrics is an F/OSS library that offers an extensive collection of similarity algorithms and their corresponding cost functions.

Upvotes: 0

Howard
Howard

Reputation: 39207

Depending on what you want you may try

  • length of longest common subsequence of both texts divided by length of text2
  • or length of longest contiguous subsequence of both texts also divided be length of text2

Both will give you 1 if the text is completely inside text1 and 0 if they do not share a common character.

Upvotes: 1

Related Questions