Reputation: 324
I have one question regarding optimization of index in Postgres, I didn't find much help online and I have struggled to get the answer myself by testing.
I have this table
CREATE TABLE "public"."crawls" (
"id" uuid NOT NULL DEFAULT uuid_generate_v4(),
"parent_id" uuid,
"group_id" timestamp,
"url" varchar(2083) NOT NULL,
"done" boolean;
PRIMARY KEY ("id")
);
CREATE UNIQUE INDEX "parentid_groupid_url" ON "public"."urls" USING BTREE ("parent_id","group_id","url");
It's an URLs store, that is used to compute a comprehensive list of URLs that are UNIQUE per parent and per group. I only need exact match on this index. This means parent_id can have multiple times the same time the same URLs as long as the group_id is different.
The table contains hundreds of millions of URLs and is mainly used for write, the UNIQUE index is for deduplication.
UPDATE crawls
SET
done = TRUE
WHERE
url = $1 AND
parent_id = $2 AND
group_id = $3
INSERT
INTO crawls (
url,
parent_id,
group_id
) VALUES
('long urls', uuid, date)
ON CONFLICT parentid_groupid_url DO NOTHING
Currently the perf are okay but could be better, and the index size is larger than the table itself because of url column.
I was wondering how I could improve the size and/or the perf ? (both if possible)
I thought about using a new column to hash (md5, sha1) the URL and use it in the index instead of the URL, so that the length is consistant, smaller and may be faster for Postgres, but I didn't find any help on that. I'm not sure it's efficient because of the "randomness" of a hash and I have hard time testing this hypothesis due to the size and the time to build the index on my prod.
Refs I found online:
Thanks,
Upvotes: 0
Views: 266
Reputation: 44167
I thought about using a new column to hash (md5, sha1) the URL and use it in the index instead of the URL, so that the length is consistant, smaller and may be faster for Postgres
create index on crawls (parent_id,group_id,md5(url));
This will automatically enforce uniqueness (and also prohibit md5 collisions which are really distinct on the full URL--but in the absence of malice the chances of this occurring are tiny). However, it will not automatically be used for fast look-up, you have to adapt your queries to allow it to be used:
WHERE
md5(url) = md5($1) AND
parent_id = $2 AND
group_id = $3
You could save more space by using a reprenation shorter than hex:
create index on crawls (parent_id,group_id,decode(md5(url),'hex'));
But that will make it even more cumbersome to use.
I'm not sure it's efficient because of the "randomness" of a hash
It depends entirely on your usage pattern and your data distribution. If you commonly access a series of records with the same parent_id and group_id and adjacent urls, and the number of records with same parent_id and group_id is large, then hashing the urls could decrease the effectiveness of caching.
I have hard time testing this hypothesis due to the size and the time to build the index on my prod.
Not having a test environment is working with both hands tied behind your back.
Upvotes: 1