Reputation: 15716
Suppose I has 4 blocks of text:
1.
Premium 95 950,034
950,03
158,34
NUMERAR: 1
REST
2.
Premium 15 950,034
111,03 aaaaa
158,34
NUMERAR: 1
REST
3.
Premium 95 950,034
950,03 bbbbb
158,34
dddddd
fffff
4.
PremiR 95 950,034
950,03
158,34
NUMERAR: 1
REST A
As you can see these blocks are different from each other. There are those that most coincide - this is block 1 and 4. There are blocks that least match - this is block 2 and 3.
Is there an algorithm (on java/kotlin) that finds the most matching text blocks? In this example : 1 and 4
How many words matches in every block?
?
P.S. Maybe Levenshtein Distance can help
Upvotes: 0
Views: 421
Reputation: 109597
You should search for correlation.
The following is not primarily for correlations. One straight-forward step-wise approach to simplify the data:
Convert every block to a sequence of words; word IDs. And use the levenshtein distance to measure the difference between every two sequences.
You could index the blocks by n-grams, subsequences of say n=3 words, thus reducing the number or combinations.
Upvotes: 2