Detecting duplicate webpages among large number of URLs

"In fact, we found even more than 1 trillion individual links, but not all of 
them lead to unique web pages. Many pages have multiple URLs with exactly the same
content or URLs that are auto-generated copies of each other. Even after removing
those exact duplicates . . . "

How does Google detect those exact duplicate webpages or documents? Any idea on Algorithm that Google uses?

Upvotes: 4

Answers (2)

User

Reputation: 66081

According to http://en.wikipedia.org/wiki/MinHash:

A large scale evaluation has been conducted by Google in 2006 [10] to compare the performance of Minhash and Simhash[11] algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling[12] and using Minhash and LSH for Google News personalization.[13]

A search for Simhash turns up this page:

~~https://liangsun.org/posts/a-python-implementation-of-simhash-algorithm/~~

https://github.com/leonsim/simhash

which references a paper written by google employees: Detecting near-duplicates for web crawling

Abstract:

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

Another Simhash paper:

http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf‎

Upvotes: 1

Zaki

Reputation: 158

possible solutions

exact methods

1) brute force: compare every new page to all visited pages (very slow and inefficient)

2) calculate hash of every visited page (md5,sha1) and store the hashes in a database and look up every new page's hash in the database

3)standard Boolean model of information retrieval (BIR)

........many other possible methods

near exact methods

1)fuzzy hash

2)latent semantic indexing

....

Upvotes: 0

Detecting duplicate webpages among large number of URLs

Answers (2)

Related Questions