Postgres optimisation of multicolumn indexes with text column

Question

I have one question regarding optimization of index in Postgres, I didn't find much help online and I have struggled to get the answer myself by testing.

I have this table

CREATE TABLE "public"."crawls" (
    "id" uuid NOT NULL DEFAULT uuid_generate_v4(),
    "parent_id" uuid,
    "group_id" timestamp,
    "url" varchar(2083) NOT NULL,
    "done" boolean;
    PRIMARY KEY ("id")
);
CREATE UNIQUE INDEX "parentid_groupid_url" ON "public"."urls" USING BTREE ("parent_id","group_id","url");

It's an URLs store, that is used to compute a comprehensive list of URLs that are UNIQUE per parent and per group. I only need exact match on this index. This means parent_id can have multiple times the same time the same URLs as long as the group_id is different.

The table contains hundreds of millions of URLs and is mainly used for write, the UNIQUE index is for deduplication.

  UPDATE crawls
  SET
    done = TRUE
  WHERE
    url = $1 AND
    parent_id = $2 AND
    group_id = $3

INSERT 

INTO crawls (
      url,
      parent_id,
      group_id
    ) VALUES
      ('long urls', uuid, date)
    ON CONFLICT parentid_groupid_url DO NOTHING

Currently the perf are okay but could be better, and the index size is larger than the table itself because of url column.

I was wondering how I could improve the size and/or the perf ? (both if possible)

I thought about using a new column to hash (md5, sha1) the URL and use it in the index instead of the URL, so that the length is consistant, smaller and may be faster for Postgres, but I didn't find any help on that. I'm not sure it's efficient because of the "randomness" of a hash and I have hard time testing this hypothesis due to the size and the time to build the index on my prod.

Refs I found online:

Thanks,

Postgres optimisation of multicolumn indexes with text column

Answers (1)

Related Questions