Brad Herman
Brad Herman

Reputation: 10635

Diffing text with html tags

I have been wanting to either find or write a new diffing gem that will not only highlight changes in text but also changes in HTML structure as well. Here's a quick example of what I mean.

Right now most diffing gems or algos out there will take something like:

a = "<p>I am some text</p>"
b = "<p>I was some text</p>"
MyDiffer.diff(a,b)
=> "<p>I <del>am</del><ins>was</ins> some text</p>"

However, when HTML tags are thrown in most of them don't properly account. I'd like to see something like this:

a = "<p>I am <strong>some</strong> text</p>"
b = "<p>I was some text</p>"
MyDiffer.diff(a,b)
=> "<p>I <del>am</del><ins>was</ins> <del class='htmlchange'><strong>some</strong></del><ins class="htmlchange">some</ins></p>"
a = "<p>I am a sentence.  I am another sentence.</p>"
b = "<p>I am a sentence.</p><p>I am another sentence.</p>"
MyDiffer.diff(a,b)
=> "<p>I am a sentence.<del class="htmlchange">I am another sentence</del></p><ins class="htmlchange"><p>I am another sentence</p></ins>"

Does something like this exists out there? If not, I'm not entirely sure how to go about building something along these lines. Any help would be appreciated.

Upvotes: 0

Views: 119

Answers (1)

the Tin Man
the Tin Man

Reputation: 160581

For HTML you'll want to use a parser, such as Nokogiri, which will do some cleanup and normalizing for you. Then you'll want to reformat the document's tags so the parameters are in consistent order. I'd recommend a simple alphabetic sort using the parameter's name.

Nokogiri's to_html method will be useful when outputting the results of your restructuring.

You'll also need to decide if whitespace will be retained or removed in text-nodes, and whether parameter and tag-name case is honored.

You could try doing it without relying on a parser, but I think you'd go nuts. HTML is too unstructured and irregular to do more than a simple diff.

Upvotes: 1

Related Questions