Alex
Alex

Reputation: 1486

Generating Unique Hash for URL crawled by crawler

I am implementing a Crawler and I wanted to generate a unique hash code for every URL crawled by my system. This will help me in checking duplicate URLs, matching complete URL can be a expensive stuff. Crawler will crawl millions of pages daily. So output of this hash function should be unique.

Upvotes: 1

Views: 587

Answers (2)

Adrian B
Adrian B

Reputation: 1631

I think the solution is to normalize URLs first by removing first parts like http:// or http://www. from the beginning and last parts like / or ?... or #.... After this cleaning, you should have a clean domain URL, and you can do a hash for it.

But the best solution is to use a bloomfilter (a probabilistic data structure) which can tell you of the URL was probably visited or guaranteed not visited

Upvotes: 1

Steve P.
Steve P.

Reputation: 14699

Unless you know every address ahead of time, and there happens to be a perfect hash for said set of addresses, this task is theoretically impossible.

By the pigeonhole principle, there must exist at least two strings that have the same Integer value no matter what technique you use for conversion, considering that Integers have a finite range, and strings do not. While addresses, in reality, are not infinitely long, you're still going to get multiple addresses that map to the same hash value. In theory, there are infinitely many strings that will map to the same Integer value.

So, in conclusion, you should probably just use a standard HashMap.

Additionally, you need to worry about the following:

www.stackoverflow.com
http://www.stackoverflow.com
http://stackoverflow.com
stackoverflow.com
...

which are all equivalent, so you would need to normalize first, then hash. While there are some algorithms that given the set first will generate a perfect hash, I doubt that that is necessary for your purposes.

Upvotes: 1

Related Questions