Reputation: 1486
I am implementing a Crawler and I wanted to generate a unique hash code for every URL crawled by my system. This will help me in checking duplicate URLs, matching complete URL can be a expensive stuff. Crawler will crawl millions of pages daily. So output of this hash function should be unique.
Upvotes: 1
Views: 587
Reputation: 1631
I think the solution is to normalize URLs first by removing first parts like http://
or http://www.
from the beginning and last parts like /
or ?...
or #...
.
After this cleaning, you should have a clean domain URL, and you can do a hash for it.
But the best solution is to use a bloomfilter (a probabilistic data structure) which can tell you of the URL was probably visited or guaranteed not visited
Upvotes: 1
Reputation: 14699
Unless you know every address ahead of time, and there happens to be a perfect hash for said set of addresses, this task is theoretically impossible.
By the pigeonhole principle, there must exist at least two strings that have the same Integer
value no matter what technique you use for conversion, considering that Integers
have a finite range, and strings do not. While addresses, in reality, are not infinitely long, you're still going to get multiple addresses that map to the same hash value. In theory, there are infinitely many strings that will map to the same Integer
value.
So, in conclusion, you should probably just use a standard HashMap
.
Additionally, you need to worry about the following:
www.stackoverflow.com
http://www.stackoverflow.com
http://stackoverflow.com
stackoverflow.com
...
which are all equivalent, so you would need to normalize first, then hash. While there are some algorithms that given the set first will generate a perfect hash, I doubt that that is necessary for your purposes.
Upvotes: 1