Reputation: 946
Lets assume that following urls are pointing to the same content.
How can I check if those links are pointing to the same content? I am particularly using Ruby but any other suggestion is welcome as well...
Upvotes: 0
Views: 508
Reputation: 31077
The first naive guess is to get the content and create a hash. However, if the content has any dynamic behavior at all, this is not a good metric.
require 'open-uri'
require 'digest/md5'
f1 = open("http://rubyonrails.org/?id=1")
c1 = f1.read
d1 = Digest::MD5.hexdigest(c1)
f2 = open("http://rubyonrails.org/");
c2 = f2.read
d2 = Digest::MD5.hexdigest(c2)
d1 == d2 # true
If we repeat the same thing with say: www.google.com and google.com the hashes won't match because there may be slight variations to content.
You can use the Jaro Winkler measure for strings, which gives you a value between 0 and 1 for how similar two strings are. There's a pure implementation of the algorithm too in ruby. The native implementations are much faster. I've used the amatch library in the past.
require 'open-uri'
require 'fuzzystringmatch'
f1 = open("http://www.google.com/")
c1 = f1.read
f2 = open("http://google.com/")
c2 = f2.read
delta = 0.1
jarow = FuzzyStringMatch::JaroWinkler.create( :pure )
distance = jarow.getDistance(c1, c2) # 0.85 .. that is the text looks to be 85% similar
Upvotes: 2