Félix Saparelli
Félix Saparelli

Reputation: 8729

Tool or technique to compare and group diffs by similarity

I have developed a system that allows visitors to submit typo corrections for my blog. It works by having a small client-side app which then sends unified diffs to a server. Behind that, I have an interface which allows me to see all diffs in a nice graphical way, sort them, etc.

However I am thinking that as time passes, many visitors will submit corrections for the same things before I have time to fix them. So I would need a way to group similar or identical diffs together.

Identical diffs are easy enough. But there might be people who fix errors differently, e.g. using American or British spellings, different rules for punctuation, varying understandings of unclear phrases, that kind of thing. Grouping similar diffs would be tremendously helpful.

Are there techniques, algorithms, or tools that are specifically designed or can be used to compute the similarity of diffs?

Upvotes: 2

Views: 80

Answers (2)

Kirill Gamazkov
Kirill Gamazkov

Reputation: 3397

Maybe you could adopt the Damerau-Levenshtein algorithm. It is used to calculate the distance between two strings.

Upvotes: 0

armel
armel

Reputation: 2580

I believe that you have two problems to solve: 1. recognizing fixes for the same text (e.g. same typo location), 2. potentially remove those with the same or nearly equal solutions and at least group all the patches that are related to that location.

Problem 1. The unified diff format is somewhat OK as it gives the lines, but a word level or character level diff (for example, counting each word as a line as wdiff does) might be more precise and help you group more precisely the patches.

Problem 2. if the patches are identical, as you noted it is trivial, if they are different, solving the problem 1 already did much of the work. You can of course use a normalization such as "inflected word parts removal" (removing 's', 'ing' and so on at end of words for example) or "lower casing" before the comparison the replacements part in the unified diffs, thus helping group together nearly identical solutions.

The problem 1 is the problem paused by integration or merge of patches. Problem 2 is more relevant to your particular case.

Upvotes: 1

Related Questions