Alexei
Alexei

Reputation: 15716

Algorithm (on java/kotlin) that finds the most matching text blocks

Suppose I has 4 blocks of text:

1.

Premium 95 950,034
950,03
158,34
NUMERAR: 1
REST



2.

Premium 15 950,034
111,03 aaaaa
158,34
NUMERAR: 1
REST


3.

Premium 95 950,034
950,03 bbbbb
158,34
dddddd
fffff


4.

PremiR 95 950,034
950,03
158,34
NUMERAR: 1
REST A

As you can see these blocks are different from each other. There are those that most coincide - this is block 1 and 4. There are blocks that least match - this is block 2 and 3.

Is there an algorithm (on java/kotlin) that finds the most matching text blocks? In this example : 1 and 4

How many words matches in every block?

?

P.S. Maybe Levenshtein Distance can help

Upvotes: 0

Views: 421

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109597

You should search for correlation.

The following is not primarily for correlations. One straight-forward step-wise approach to simplify the data:

Convert every block to a sequence of words; word IDs. And use the levenshtein distance to measure the difference between every two sequences.

  • Slow, quadratic O(N²).
  • Does not respect structured data (title, number X, number Y)
  • This does not respect similar words Premium/PremiR.

You could index the blocks by n-grams, subsequences of say n=3 words, thus reducing the number or combinations.

Upvotes: 2

Related Questions