Reputation: 1
What data structures exist with the following properties better than a plain Set:
Context: We are trying to store a bunch (+1.000.000) of URI strings in memory with common parts, for example:
Upvotes: 0
Views: 30
Reputation: 708036
The best I can think of is to reduce the input to the differing parts (remove the common parts) and then use a Set
. The options for built-in fast lookup are properties on a plain object, Set or Map. Otherwise, you'd have to implement or find an implementation for your own hashed lookup which is unlikely to be faster than the built-in implementations.
So, based on the bit of information you provided, I'd choose a Set on the differing parts. For the two examples you provided, I would assume that http://marineregions.org/mrgid/
are the common parts and 52969?t=1638014401
and 57554?t=1638014414
are the different parts so we could put only the differing parts into the Set.
Here's an implementation:
class CommonSet extends Set {
constructor(commonPrefix) {
super();
this.commonPrefix = commonPrefix;
}
add(url) {
super.add(this._removePrefix(url));
}
has(url) {
let token = this._removePrefix(url);
return super.has(token);
}
_removePrefix(url) {
if (url.indexOf(this.commonPrefix) !== 0) {
throw new Error("url does not start with commonPrefix");
}
return url.slice(this.commonPrefix.length);
}
_addPrefix(token) {
return this.commonPrefix + token;
}
// serialize existing data to JSON string
serialize() {
return JSON.stringify({ commonPrefix: this.commonPrefix, data: Array.from(this) })
}
// populate existing object from JSON string
_deserialize(json) {
const obj = JSON.parse(json);
this.commonPrefix = obj.commonPrefix;
if (!this.commonPrefix) {
throw new Error("didn't find .commonPrefix property");
}
if (!Array.isArray(obj.data)) {
throw new Error("didn't find .data array property");
}
for (let url of obj.data) {
super.add(url);
}
}
// create new CommonSet from previously serialized data
static deserialize(json) {
let s = new CommonSet();
s._deserialize(json);
return s;
}
}
Notes about your requirements:
fast contains method
Uses the .has()
method from the Set object. Internally hashed, 100% reliable
fast add method
Uses the .add()
method from the Set Object. Internally hashed.
memory-efficient
Removes common parts of the URLs to save space.
(de)serialisable (memory-efficient)
This can be serialized to an array of data (plus the commonPrefix) for efficient storage and then deserialized from that. See .serialize()
methods that serializes to JSON and the static `.deserialize() that creates a new CommonSet from previously serialized JSON. This serialization format (an array of the differing parts of the URL) is fairly efficient as there are no repeated property names of redundant objects boundaries. For pure space efficiency, you could abandon pure JSON and then quotation marks could be removed with an assumed string format. This would save two characters per URL, but complicate parsing when reading it back since it's no longer legit JSON so you'd have to build your own parser.
exact membership function -> no probabilistic data structure
The Set's .has()
is exact. Hash collisions are handled internally in a reliable manner.
Other Notes:
You can quickly add support for other Set methods such as .delete()
if required. .clear()
will already work as is.
Upvotes: 1