Reputation: 65173
So, I want to be able to find the diff between two strings on a per-word basis (maybe faster than per-character, though, if per-character is faster then I'd want to do it that way).
Here is an example of what I want to achieve: Source Text:
Hello there!
Modified Text:
Helay scere?
diff:
Hel[lo](ay) [th](sc)ere[!](?)
there is kind of a super hackish way to do this using a commandline tool, such as opendiff, but it requires a newline character inbetween every character, as opendiff is line-based.
I'm using ruby, and haven't found any tools to do this... but language isn't terribly important, as algorithms can be ported pretty easily.
thanks.
Upvotes: 9
Views: 2022
Reputation: 357
Have a look to https://github.com/pvande/differ. This gem does what you are looking for
Upvotes: 3
Reputation: 65173
Here is a ruby gem that does diffing of strings: http://rubydoc.info/gems/diff-lcs/1.1.3/frames
Before hand, I just did (in irb)
require 'rubygems'
require 'diff/lcs'
require 'diff/lcs/array'
require 'diff/lcs/string'
So, writing the logic to insert, inline removed and inserted markers becomes trivial thanks to this 2D diff array of changes.
Though I'm not sure if this is this the best way.
Upvotes: 2
Reputation: 544
So what you can do repeatedly use the LCS (as linked above) to find all the common strings, and remove them from both your strings, replacing them with some other string - let's just say a "*". Then you iterate through both strings at the same time, and re-mesh the common and distinct back together.
Example
A) Hello there!
B) Helay scere?
LCS detection gives us ["Hel"," ","ere"], and after replacement we have
A) *lo*th*!
B) *ay*sc*?
Now you split on the delimiter ("*") giving you
A) ["lo","th","!"]
B) ["ay","sc","?"]
And from here you've just go to do simple meshing. Something key to note is that there may be null entries, for example if you are doing this method on "Hell" and "Hel" you'll eventually get
Common LCS) ["Hel"]
A) ["l"]
B) [""]
meaning your result will be Hel[l]()
Hopefully that is acceptable.
Upvotes: 2
Reputation: 9225
You may want to check this: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem. It's not hard to implement.
Upvotes: 3