Mesut
Mesut

Reputation: 946

How to check if two urls return the same page

Lets assume that following urls are pointing to the same content.

How can I check if those links are pointing to the same content? I am particularly using Ruby but any other suggestion is welcome as well...

Upvotes: 0

Views: 508

Answers (1)

Roland Mai
Roland Mai

Reputation: 31077

The first naive guess is to get the content and create a hash. However, if the content has any dynamic behavior at all, this is not a good metric.

require 'open-uri'
require 'digest/md5'

f1 = open("http://rubyonrails.org/?id=1")
c1 = f1.read
d1 = Digest::MD5.hexdigest(c1)

f2 = open("http://rubyonrails.org/");
c2 = f2.read
d2 = Digest::MD5.hexdigest(c2)

d1 == d2 # true

If we repeat the same thing with say: www.google.com and google.com the hashes won't match because there may be slight variations to content.

You can use the Jaro Winkler measure for strings, which gives you a value between 0 and 1 for how similar two strings are. There's a pure implementation of the algorithm too in ruby. The native implementations are much faster. I've used the amatch library in the past.

require 'open-uri'
require 'fuzzystringmatch'

f1 = open("http://www.google.com/")
c1 = f1.read

f2 = open("http://google.com/")
c2 = f2.read

delta = 0.1
jarow = FuzzyStringMatch::JaroWinkler.create( :pure )
distance = jarow.getDistance(c1, c2) # 0.85 .. that is the text looks to be 85% similar

Upvotes: 2

Related Questions