Kasper Zutterman
Kasper Zutterman

Reputation: 1

Space-efficient (de)serialisable URI lookup data structure

What data structures exist with the following properties better than a plain Set:

Context: We are trying to store a bunch (+1.000.000) of URI strings in memory with common parts, for example:

Upvotes: 0

Views: 30

Answers (1)

jfriend00
jfriend00

Reputation: 708036

The best I can think of is to reduce the input to the differing parts (remove the common parts) and then use a Set. The options for built-in fast lookup are properties on a plain object, Set or Map. Otherwise, you'd have to implement or find an implementation for your own hashed lookup which is unlikely to be faster than the built-in implementations.

So, based on the bit of information you provided, I'd choose a Set on the differing parts. For the two examples you provided, I would assume that http://marineregions.org/mrgid/ are the common parts and 52969?t=1638014401 and 57554?t=1638014414 are the different parts so we could put only the differing parts into the Set.

Here's an implementation:

class CommonSet extends Set {
    constructor(commonPrefix) {
        super();
        this.commonPrefix = commonPrefix;
    }
    add(url) {
        super.add(this._removePrefix(url));
    }
    has(url) {
        let token = this._removePrefix(url);
        return super.has(token);
    }
    _removePrefix(url) {
        if (url.indexOf(this.commonPrefix) !== 0) {
            throw new Error("url does not start with commonPrefix");
        }
        return url.slice(this.commonPrefix.length);
    }
    _addPrefix(token) {
        return this.commonPrefix + token;
    }
    // serialize existing data to JSON string
    serialize() {
        return JSON.stringify({ commonPrefix: this.commonPrefix, data: Array.from(this) })
    }
    // populate existing object from JSON string
    _deserialize(json) {
        const obj = JSON.parse(json);
        this.commonPrefix = obj.commonPrefix;
        if (!this.commonPrefix) {
            throw new Error("didn't find .commonPrefix property");
        }
        if (!Array.isArray(obj.data)) {
            throw new Error("didn't find .data array property");
        }
        for (let url of obj.data) {
            super.add(url);
        }
    }
    // create new CommonSet from previously serialized data
    static deserialize(json) {
        let s = new CommonSet();
        s._deserialize(json);
        return s;
    }
}

Notes about your requirements:

fast contains method

Uses the .has() method from the Set object. Internally hashed, 100% reliable

fast add method

Uses the .add() method from the Set Object. Internally hashed.

memory-efficient

Removes common parts of the URLs to save space.

(de)serialisable (memory-efficient)

This can be serialized to an array of data (plus the commonPrefix) for efficient storage and then deserialized from that. See .serialize() methods that serializes to JSON and the static `.deserialize() that creates a new CommonSet from previously serialized JSON. This serialization format (an array of the differing parts of the URL) is fairly efficient as there are no repeated property names of redundant objects boundaries. For pure space efficiency, you could abandon pure JSON and then quotation marks could be removed with an assumed string format. This would save two characters per URL, but complicate parsing when reading it back since it's no longer legit JSON so you'd have to build your own parser.

exact membership function -> no probabilistic data structure

The Set's .has() is exact. Hash collisions are handled internally in a reliable manner.

Other Notes:

You can quickly add support for other Set methods such as .delete() if required. .clear() will already work as is.

Upvotes: 1

Related Questions