JoeFrizz
JoeFrizz

Reputation: 711

URL representation

I wonder how to efficiently store a website URL in a database (mongoDB in my case)...

The problem: It should be indexed to achieve fast query performance but mongo allows indexes on fields smaller than 1024 bytes "only".

I thought about hashing or base64 to shrink the URL... but since I use a single threaded webserver (node.js) I don't want to do heavy stuff on it...

Are there any good ideas about other ways to achieve this (the alternative representation should be unique...)?

Upvotes: 1

Views: 361

Answers (1)

jmikola
jmikola

Reputation: 6922

This very question comes up during 10gen's MongoDB training and hashing is presented as the viable solution. Generating an MD5 hash for a URL shouldn't be computationally intensive. I definitely wouldn't suggest base64-encoding, as that's only going to expand the URL string.

The goal is to create an index with high cardinality, but that doesn't mean the hashes have to be unique. If you include both the hash and URL in your query, you'll take advantage of the highly-selective hash index and then MongoDB will match the URL among the index hits. In the following example, let's pretend there is a hash collision for both URL's:

$ mongo --quiet
> db.urls.insert({_id: 1, url: "http://google.com", hash: "c7b920f"});
> db.urls.insert({_id: 2, url: "http://yahoo.com", hash: "c7b920f"});
> db.urls.find({hash: "c7b920f"})
{ "_id" : 1, "url" : "http://google.com", "hash" : "c7b920f" }
{ "_id" : 2, "url" : "http://yahoo.com", "hash" : "c7b920f" }

> db.urls.find({hash: "c7b920f", url: "http://google.com"})
{ "_id" : 1, "url" : "http://google.com", "hash" : "c7b920f" }

> db.urls.ensureIndex({hash: 1})
> db.urls.find({hash: "c7b920f", url: "http://google.com"}).explain()
{
    "cursor" : "BtreeCursor hash_1",
    "nscanned" : 2,
    "nscannedObjects" : 2,
    "n" : 1,
    "millis" : 0,
    "nYields" : 0,
    "nChunkSkips" : 0,
    "isMultiKey" : false,
    "indexOnly" : false,
    "indexBounds" : {
        "hash" : [
            [
                "c7b920f",
                "c7b920f"
            ]
        ]
    },
    "server" : "localhost:27017"
}

I'm not sure if you have additional business requirements to guarantee URL uniqueness throughout the collection, but the example above is just to show that it isn't necessary from a querying standpoint. Of course, any hash algorithm is going to have some chance of collision, but you have better options than MD5 that would still satisfy the 1024-byte limit.

Upvotes: 4

Related Questions