Joshua Enfield
Joshua Enfield

Reputation: 18278

Compare two websites and see if they are "equal?"

We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?

Upvotes: 3

Views: 3632

Answers (6)

roesslerj
roesslerj

Reputation: 2661

Using the open source tool recheck-web (https://github.com/retest/recheck-web), there are two possibilities:

For both solutions you currently need to manually list all relevant URLs. In most situations, this shouldn't be a big problem. recheck-web will compare the rendered website and show you exactly where they differ (i.e. different font, different meta tags, even different link URLs). And it gives you powerful filters to let you focus on what is relevant to you.

Disclaimer: I have helped create recheck-web.

Upvotes: 1

Pedro
Pedro

Reputation: 1021

I've created the following PHP code that does what Weboide suggest here. Thanks Weboide!

the paste is here:

http://pastebin.com/0V7sVNEq

Upvotes: 1

Warner
Warner

Reputation: 101

Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:

diff -r /tmp/directory1 /tmp/directory2

For all intents and purposes, you can put them in your preferred location with your preferred naming convention.

Edit 1

You could potentially use lynx -dump or a wget and run a diff on the results.

Upvotes: 0

coredump
coredump

Reputation:

The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.

IF the pages have dynamic content you will have to download the site using a tool like wget

wget --mirror http://thewebsite/thepages

and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.

Upvotes: 2

Jeff McJunkin
Jeff McJunkin

Reputation: 128

Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.

However, it is certainly possible to compare the downloaded website after downloading recursively with wget.

  wget [option]... [URL]...

   -m
   --mirror
       Turn on options suitable for mirroring.  This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
       directory listings.  It is currently equivalent to -r -N -l inf --no-remove-listing.

The next step would then be to do the recursive diff that Warner recommended.

Upvotes: 0

Weboide
Weboide

Reputation: 1110

Get the formatted output of both sites (here we use w3m, but lynx can also work):

w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html

Then use wdiff, it can give you a percentage of how similar the two texts are.

wdiff -nis /tmp/1.html /tmp/2.html

It can be also easier to see the differences using colordiff.

wdiff -nis /tmp/1.html /tmp/2.html | colordiff

Excerpt of output:

Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion

                           Google [hp1] [hp2]
                                  [hp3] [-Français-] {+Deutschland+}

           [                                                         ] Recherche
                                                                       avancéeOutils
                      [Recherche Google][J'ai de la chance]            linguistiques


/tmp/1.html: 43 words  39 90% common  3 6% deleted  1 2% changed
/tmp/2.html: 49 words  39 79% common  9 18% inserted  1 2% changed

(he actually put google.com into french... funny)

The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).

Upvotes: 5

Related Questions