Data Structure for storing partial urls where search speed is the priority

Question

We have a large database of partial urls (strings) such as:

"example1.com"
"example2.com/test.js"
"/foo.js"

Our software listens for HTTP requests and tries to find one of our database's partial urls in the HTTP request's full url.

So we are getting full urls (i.e.: http://www.example.com/blah.js?foo=bar") and trying to match one of our database's partial patterns on it.

Which would be the best data structure to store our partial urls database on if all we care about is search speed?

Right now, this is what we do:

Iterating through the entire database of partial urls (strings) and using indexOf (in javascript) to see if the full url contains each partial string.

UPDATE:

This software is an extension for Firefox written in Javascript on Firefox's Addon SDK.

stan0 · Accepted Answer

Assuming that your partial strings are only domain names and/or page names you could try to generate all possible combinations from the URL starting from the end:

 http://www.example.com/blah.js?foo=bar
 blaj.js
 example.com/blah.js
 www.example.com/blah.js

Then hash all the combinations, store them in an array and try to find any of them in another array that contains the hashes of all your partial strings from the database.

NOTE:

In case you want to match ANY string in the url, like ample in example.com it becomes little complicated in terms of storage, because all random combinations of strings in an url are Combination formula

where n is the length of the url and k is the length of the string to find. According to this SO question the maximum reasonable length of a url is 2000 characters. And assuming that you want to match random string you'd have k vary from 1 to 2000 which would result in a large amount of hashes generated from the url - Sum of n over k for each k from 1 to 2000. Or more precisely - 2000! / (k!*(2000-k)!) different hashes

Data Structure for storing partial urls where search speed is the priority

Answers (2)

Related Questions