josesigna
josesigna

Reputation: 488

Data Structure for storing partial urls where search speed is the priority

We have a large database of partial urls (strings) such as:

Our software listens for HTTP requests and tries to find one of our database's partial urls in the HTTP request's full url.

So we are getting full urls (i.e.: http://www.example.com/blah.js?foo=bar") and trying to match one of our database's partial patterns on it.

Which would be the best data structure to store our partial urls database on if all we care about is search speed?


Right now, this is what we do:

UPDATE:

This software is an extension for Firefox written in Javascript on Firefox's Addon SDK.

Upvotes: 1

Views: 237

Answers (2)

stan0
stan0

Reputation: 11817

Assuming that your partial strings are only domain names and/or page names you could try to generate all possible combinations from the URL starting from the end:

 http://www.example.com/blah.js?foo=bar
 blaj.js
 example.com/blah.js
 www.example.com/blah.js

Then hash all the combinations, store them in an array and try to find any of them in another array that contains the hashes of all your partial strings from the database.

NOTE:

In case you want to match ANY string in the url, like ample in example.com it becomes little complicated in terms of storage, because all random combinations of strings in an url are Combination formula

where n is the length of the url and k is the length of the string to find. According to this SO question the maximum reasonable length of a url is 2000 characters. And assuming that you want to match random string you'd have k vary from 1 to 2000 which would result in a large amount of hashes generated from the url - Sum of n over k for each k from 1 to 2000. Or more precisely - 2000! / (k!*(2000-k)!) different hashes

Upvotes: 1

Zorayr
Zorayr

Reputation: 24962

There are a few things you could do:

  • Don't process the URLs on the client side. JavaScript is going to be slow, especially if you have a lot of these URLs. You can create a REST API and pass in the URL for matching as a query parameter i.e. domain.com/api/?url=.... Placing the heavy lifting and memory use on the server side will also decrease your bandwidth.
  • Bootstrap the URLs into RAM and don't read from the database every time. Something like memcached could work perfectly in this case.
  • Once in ram, a HashTable structure would work the best since you are doing simple matching. Whatever you do, avoid string comparison.

If you follow these few suggestions, you would have a significant speedup. Hope this helps.

Upvotes: 0

Related Questions