NullVoxPopuli
NullVoxPopuli

Reputation: 65173

What is the best (word or character)-based diff algorithm out there?

So, I want to be able to find the diff between two strings on a per-word basis (maybe faster than per-character, though, if per-character is faster then I'd want to do it that way).

Here is an example of what I want to achieve: Source Text:

Hello there!

Modified Text:

Helay scere?

diff:

Hel[lo](ay) [th](sc)ere[!](?)

there is kind of a super hackish way to do this using a commandline tool, such as opendiff, but it requires a newline character inbetween every character, as opendiff is line-based.

I'm using ruby, and haven't found any tools to do this... but language isn't terribly important, as algorithms can be ported pretty easily.

thanks.

Upvotes: 9

Views: 2022

Answers (5)

alex
alex

Reputation: 357

Have a look to https://github.com/pvande/differ. This gem does what you are looking for

Upvotes: 3

NullVoxPopuli
NullVoxPopuli

Reputation: 65173

Here is a ruby gem that does diffing of strings: http://rubydoc.info/gems/diff-lcs/1.1.3/frames

Before hand, I just did (in irb)

require 'rubygems'
require 'diff/lcs'
require 'diff/lcs/array'
require 'diff/lcs/string'

enter image description here

So, writing the logic to insert, inline removed and inserted markers becomes trivial thanks to this 2D diff array of changes.

Though I'm not sure if this is this the best way.

Upvotes: 2

Noxville
Noxville

Reputation: 544

So what you can do repeatedly use the LCS (as linked above) to find all the common strings, and remove them from both your strings, replacing them with some other string - let's just say a "*". Then you iterate through both strings at the same time, and re-mesh the common and distinct back together.

Example

A) Hello there!
B) Helay scere?

LCS detection gives us ["Hel"," ","ere"], and after replacement we have
A) *lo*th*!
B) *ay*sc*?

Now you split on the delimiter ("*") giving you
A) ["lo","th","!"]
B) ["ay","sc","?"]

And from here you've just go to do simple meshing. Something key to note is that there may be null entries, for example if you are doing this method on "Hell" and "Hel" you'll eventually get

Common LCS) ["Hel"]
A) ["l"]
B) [""]

meaning your result will be Hel[l]() 

Hopefully that is acceptable.

Upvotes: 2

Bhavana C
Bhavana C

Reputation: 69

A solution will be to find the edit distance between the strings.

Upvotes: 0

Victor Moroz
Victor Moroz

Reputation: 9225

You may want to check this: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem. It's not hard to implement.

Upvotes: 3

Related Questions